Sensor fusion for 3d object detection

Edoardo Gruppi
Apr 7, 2021
3 min read

Updated: May 26, 2021

A state-of-the-art model based on the fusion of RGB image data and LiDAR measurements.

Object detection and segmentation are two crucial key elements for autonomous driving systems to understand the surrounding environment. In general, the aforementioned Perception task is carried out exploiting data retrieved by distinct typologies of sensors rather than a single sensor.

Merging the information captured on different domains enables to increase the data redundancy confirming or supplementing the achieved insights.

An additional aspect to take under consideration within the self-driving scenario regards the limited computational resources available. Therefore, a methodology aimed at addressing the perception problem usually has to be relatively lightweight and with low runtime needs.

LaserNet++

As demonstrated by the results obtained, the LaserNet++ model introduced by Meyer et al. is a valid solution to reach the defined objectives without violating the imposed resource constraints. Specifically, the model extends the LaserNet LiDAR-based object detector fusing the information captured by one or more cameras with LiDAR data. Since the object detection and segmentation are performed simultaneously with the same network, the model lowers the overall computational and time requirements that would be remarkably greater using two independent structures.

High-resolution images can enhance the performance of a model when combined with LiDAR measurements given that, with the increment of the distance, the latter tend to be more and more sparse. RGB images also provide texture and colour information that cannot be directly extracted from LiDAR data. On the other hand, LiDAR sensors overcome the camera's well-known sensitivity to ambient light conditions.

In literature, the methods based on LiDAR inputs can be mainly categorised into two different classes: Bird's Eye View (BEV) methods and native Range View (RV) methods. Whereas in the first the physical size of the objects is preserved with the distances leading usually to better performance, the second is based on a compressed representation of the LiDAR data thus resulting in more efficient and less expensive implementations.

LaserNet++ stands out over the other contemporary studies based on the image and LiDAR data fusion for its computational efficiency and low latency. The model receives as input an RGB image aligned with a five-channel LiDAR image. Considering a LiDAR sensor with a set of n lasers, the related image is built by mapping lasers to rows and discrete azimuth angles to columns. For every cell containing a measurement, the five channels describe the point's range, height, azimuth angle, intensity and whether the cell is occupied. A high-level overview of the model is displayed in the following picture.

As noticeable, as soon as the RGB image features are extracted through a Convolutional Neural Network (CNN) they are projected into the LiDAR image. It was indeed discovered by the authors that merging high-level representations of the RGB images, instead of their raw version, with LiDAR data leads to better results. The necessary mapping to copy RGB features into the LiDAR images is obtained through the following formula whose elements are learnt by primarily projecting the LiDAR points p onto the RGB image.

The CNN network comprises three consecutive ResNet blocks each consisting of 8 convolutional layers alternating with skip connections.

In brief, CNNs are architectures particularly suited to work on images since capable of extracting meaningful features from them while preserving their spatial structure. Specifically, in the convolutional layers from which the network derives its name one or more filters slide across the entire input representation returning the products with the covered pixels as outputs. Skip connections are alternative paths through which low-level information such as the underlying shared image structure can move circumventing bottlenecks in the network. During backpropagation, they help to mitigate the vanishing gradient problem as well.

Functioning of a convolutional layer with 4 filters 3x3

Whereas the filter size of the convolutional layers remains the same across the entire CNN architecture, the number of filters (or kernels) adopted is 16, 24, 32 for the first, the second and the third layer, respectively. Afterwards, the RGB image features extracted and warped into the LiDAR image are concatenated with high-level representations obtained applying an equivalent CNN network to that described also to LiDAR data. Hence, the result is injected into the LaserNet network that performs the object detection and segmentation tasks.

From the outcomes reported in the paper, the model showcases state-of-the-art performance demonstrating the improvement led by the addition of RGB image inputs. The outcomes are also superior when compared with LaserNet, especially for distant and small objects.

References

Meyer, Gregory P., et al. "Sensor fusion for joint 3d object detection and semantic segmentation." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops. 2019.

The images in the blog are either copyright free or designed from scratch. Some illustrations are inspired and partially realized starting from the images presented in the paper in reference.

Sensor fusion for 3d object detection

A state-of-the-art model based on the fusion of RGB image data and LiDAR measurements.

LaserNet++

References

Related Posts

Hozzászólások