Fusion of an RGB camera and LiDAR sensor through a Graph CNN for 3D object detection

Jinsol Choi; Minwoo Shin; Joonki Paik; Joonki Paik

doi:10.1364/OPTCON.479777

1. Introduction

Recently, various developments have been made in the visual recognition task due to the development of deep learning and data acquisition technologies [1–5]. If the image acquired from the optical camera or the point cloud acquired from the LiDAR is processed independently, it can cause serious problems in the perception task. For example, when some objects are covered in complex scenes, or in extreme weather such as fog or heavy rain, data cannot be acquired well, so there is a limit to accurate perception. In order to compensate for the shortcomings of such a single sensor, various sensor fusion technologies are attracting a lot of attention.

Camera and LiDAR used in three-dimensional (3D) computer vision have their own strengths and weaknesses for sensor fusion. The image acquired from the camera has a grid-like structure and the data format is regular and dense, so semantic information is richer than that of the point cloud. However, two-dimensional (2D) camera images have limitations in obtaining accurate results in fields that require depth information or structural information compared to point clouds that naturally contain 3D information without projection. On the other hand, point cloud acquired by LiDAR has useful geometrical information for understanding 3D scenes, so it is usefully used in many fields such as 3D object detection, semantic segmentation, and object tracking. However, because the data is usually sparse and unevenly distributed, it does not represent distant or small objects well. As such, due to the different advantages and disadvantages of image and point cloud, there are active researches to use the complementary relationship by fusion of camera and LiDAR data in the field of sensor fusion. However, since camera and LiDAR have different structural data characteristics, features obtained from each sensor are expressed in different contexts, so it is still difficult to fuse the two without loss of information.

To solve this problem, we propose a graph feature fusion (GFF) module that uses a Graph CNN to minimize the loss of semantic and structural information that occurs when image features and point cloud features are fused. The GFF module fuses local features extracted from images and point clouds using a Graph Convolution Neural Network, then adds relationship information between each feature, and creates a powerful point-image fusion feature in scale and geometry. We perform 3D object detection as a downstream task to evaluate the contribution of feature fusion in the proposed method. The proposed method (Fig. 1) takes an image and a raw point cloud as input, and uses EPNet [6] with improved 3D object detection performance as a framework to fuse two sensor information in the feature stage.

The main contributions of the proposed study are summarized as follows.

(1) We propose a GFF module that fuses each local feature obtained through Camera and LiDAR, adds relationship information between each feature, and creates a powerful fusion feature in scale and geometry.
(2) To compensate for the sparseness of the point cloud for distant objects, the camera image provides semantic information. In the GFF module, point-based Graph CNN is used for point cloud features and image features to converge without loss of information of each feature. This makes it possible to take advantage of the complementary characteristics of the two sensors.

This paper is organized as follows. Section 2 introduces existing works related to the proposed method, Section 3 summarizes the background of the proposed techniques, and Section 4 describes the proposed method in detail. First, Point Stream in RPN, which is a point cloud feature extraction step, and Image Stream in RPN, which performs image feature extraction, are explained, and the Graph Feature Fusion (GFF) module that fuses each features(point cloud feature, image feature) is explained. After Section 5 evaluates the performance of the proposed method by comparing it with various 3D object detection using image and point cloud methods on the KITTI dataset, Section 6 concludes the paper.

Fig. 1. Flow chart of the proposed method. The model takes in both a point cloud and an image as inputs, and then extracts features from each input separately. The Graph Feature Fusion (GFF) module is used to fuse the extracted features, and the resulting fusion feature is added to the point cloud feature.

Download Full Size | PDF

2. Related works

2.1 Deep learning methods for the 3D point cloud

2.1 Point-based methods

The point-based multi-layer perceptron (MLP) method applies a pooling function after sharing layers to aggregate global features. A general deep learning method cannot be directly applied due to the irregularity of the point cloud. To overcome these limitations, Qi et al. proposed a pioneering deep learning-based point cloud processing method called PointNet, which takes a raw point cloud as a direct input using a symmetric function to consider permutation invariance characteristics [7]. More specifically, it learns point-wise features independently learned through multiple MLP layers and extracts global features using max pooling.

However, since PointNet learns each point cloud independently, local structural information cannot be obtained. Therefore, Qi et al. also proposed an improved version called PointNet++, which maintains the original geometrical information as much as possible by obtaining global features in a hierarchical manner using the original PointNet to obtain the geometrical structure in the neighborhood of each point [8].

2.1 Graph-based methods

A graph-based deep neural network (DNN) treats each point in the point cloud as a node in the graph and creates edges of the graph based on the neighbors of each point. As a pioneering work, Simonovsky et al. considered each point as a node in the graph and connected each node to all its neighbors with edges [9]. Based on Simonovsky’s method, we applied edge-conditioned convolution (ECC) using MLP. It uses max pooling to aggregate neighborhood information and implements a graph-based on voxel-grid. Phan et al. proposed a deep graph convolutional neural network called DGCNN, which constructs a graph in the feature space, and dynamically updates the graph for each layer of the network [10]. The proposed edge-based convolution method, called EdgeConcv, is applied to the graph calculated at each layer and aggregates local context information by considering a number of points close to each center point.

2.2 Fusion of the image and point cloud

2.2 Early fusion

Early fusion combines data-level LiDAR data and data- or feature-level image data. It fuses data from different sensors through spatial alignment and projection at the raw data level of LiDAR. Liang et al. proposed PointPainting [11], and Xie et al. proposed PI-RCNN [12] to improve the object detection performance by fusing the semantic segmentation feature of the image to the raw LiDAR point cloud. Another approach is to convert the raw point cloud into a voxelized tensor and then fuse the semantic segmentation features of the image [13,14]. Meyer et al. converted a 3D LiDAR point cloud converted into a 2D image, and then fused the features of converted and real 2D images using CNN [15]. However, the performance of the early fusion method is limited by the pretrained 2D image semantic segmentation network.

2.2 Late fusion

Late fusion is a method of fusing the results of each sensor pipeline. Pang et al. proposed a late fusion method, called CLOCs, which utilizes detection results of both the LiDAR point cloud and camera image, and then predicts the final 3D bounding box using the results [16]. Asvadi et al. proposed a 2D object detection methods, which produces the IoU score by combining the proposals obtained from the LiDAR point cloud and the camera image [17]. However, the performance of the late fusion method is limited by the pipeline performance of each sensor in a way that the task of different sensors is performed and then combined.

2.2 Deep fusion

Deep fusion combines feature-level data of LiDAR and data- or feature-level camera data. Liang et al. proposed a continuous fusion layer that projects image features to the bird’s-eye view (BEV) space, and then fuses them with point cloud features [18]. It fuses images of different resolutions and LiDAR through a LiDAR-based convolution layer. Yoo et al. and Huang et al. also proposed deep fusion methods that acquire the features of the camera image and LiDAR point cloud, respectively, and fuse them in the feature space [6,19].

2.3 3D object detection using fusion of the image and point cloud

Various fusion methods have been proposed to utilize the advantages of both camera image data and LiDAR point cloud data in the field of 3D object detection [11,19–25]. Chen et al. proposed a multi-view 3D object detection network called MV3D [20], and Ku et al. proposed joint 3D proposal generation method for 3D object detection called AVOD [21]. Both methods refine the bounding box by fusing the BEV and camera feature maps for each ROI region. These multi-view methods generally have better performance than the single-view method, they lose geometrical structure information in the process of converting the point cloud to a specific view, which finally results in the detection performance degradation.

To solve this problem, Sindagi et al. proposed a multimodal voxel net called MVX-Net, which improves the representation of voxel features to semantic image features by fusing camera images with LiDAR point cloud features in the early stages [22]. Yoo et al. also proposed a cross-view spatial feature fusion method called 3D-CVF, which effectively fuses the spatial features of the camera image and LiDAR point cloud by utilizing cross-view spatial feature fusion [19]. However, 3D-CVF has a limitation in establishing an accurate correspondence between the camera image and the LiDAR point cloud because geometrical structure information of the point cloud is lost due to the gridation process. Huang et al. proposed a point feature enhancement method called EPNet using deep fusion between point cloud feature extraction and image feature extraction to improve the semantic information of point features to improve the performance of the object detector [6]. However, EPNet uses only fully-connected (FC) layers, which cannot completely represent fused features due to low capacity. In this paper, we used the EPNet as a baseline for deep fusion of point features and image features. We also proposed a fusion module that complements the correlation and structural information between each feature using Graph CNN.

3. Background

3.1 Association of the image and point cloud

Since the point cloud acquired by the LiDAR sensor and the image acquired by the camera have different structural data characteristics, the features obtained from the camera and LiDAR are expressed in different contexts. In order to obtain the correspondence between the two, the point cloud is projected onto the camera image to obtain the projection matrix M, and the relationship between the point position and the image pixel is established. Image pixel position corresponding to a point $P=(x,y,z)$ through the predefined projection matrix $M$, i.e., the projected point $p=(x/w,y/w,1)$ can be obtained.

(1)$$p = M \times P,$$

where $P$ and $p$ respectively represent 2D and 3D vector representations in the projection space, and $M\in \mathbb {R}^{3\times 4}$ the projection matrix.

Since the point-image sampler can have a projection point $p'$ between adjacent pixels, the image feature corresponding to $p'$ is obtained after calculating continuous coordinates using the bilinear interpolation method. In this way, the representation of image features can be transformed to be similar to point features.

(2)$$F^{(p)}=B(F^{(N(p'))}),$$

where projection point $p'$ and image feature $F$ are used as inputs. $F^{(p)}$ for image feature information corresponding to each point $P$, $B$ represents the bilinear interpolation function, and $F^{N(p')}$ is the image feature of an adjacent pixel for the projection point $p'$.

3.2 Refinement network

The detection downstream module of the 3D object detection network provides the proposals predicted by the region proposal network (RPN) to the refinement network using the non-maximum suppression (NMS) procedure as in PointRCNN [26] and EPNet [6]. For each input proposal, a feature descriptor is generated by randomly selecting 512 points from the bounding box at the top of the last set abstraction (SA) layer [8] of the point and image streams in the RPN. For proposals with fewer than 512 points, the descriptor is zero-padded. The refinement network consists of two subnetworks. One is three SA layers for extracting global descriptor, and the other is two $1\times 1$ convolution layers for classification and regression.

4. Proposed method

In order to perform the perception work accurately, it is important to use the information of the camera and LiDAR sensor by fusion. In case of point cloud, which is acquired by a LiDAR sensor, semantic information is not sufficient. Therefore, we propose a 3D object detection method by combining camera and Lidar data based on Graph CNN that effectively fuses image features rich in semantic information with point features using graph feature fusion (GFF) module. In addition, this method uses the 3D object detection method as a downstream task and EPNet [6], which is a 3D object detector using fusion of image and point cloud, as a baseline for fusion of camera and LiDAR sensors. As shown in Fig. 2, the GFF module is applied to the two-stream RPN of EPNet [6] and fuses the local features of the image and the local features of the point cloud. Unlike the method that uses a fully connected (FC) layer for feature fusion in the baseline, this module uses an edge convolution layer [10] to make up geometrical relationship information between feature points. In this way, it is possible to obtain a fusion feature that combines semantic information of image features and regional geometrical relationship information between point cloud features. They are then added to the global capabilities of the point cloud, ultimately helping to improve recognition operations.

Fig. 2. Two-stream region proposal network (RPN) using proposed graph feature fusion (GFF) modules. By using multiple GFF modules, multi-scale image and point cloud features are fused to improve semantic context.

Download Full Size | PDF

For convenience of explanation, the output of set abstraction (SA) and feature propagation (FP) layers are expressed as $\text {SA}_i$ and $\text {FP}_i$, for $(i=1,2,3,4)$ respectively, and the features fused through the GFF module are expressed as $F^F$.

4.1 Point stream in RPN

In Fig. 2, the upper block represents the point stream which takes in a LiDAR point cloud as input. It then learns features for each individual point and produces 3D proposals.

The point stream uses PointNet++, which was proposed in [8], for feature extraction, and consists of four pairs of SA and FP layers. To obtain strong features among multiple scales while preserving semantic information of image, the output of the last FP layer, denoted as $F^P_8$, becomes the input of the last GFF module together with $F^I_\text {total}$ to produce the final fused feature, denoted as $F^F_\text {total}$. The details of the image feature $F^I$ will be given in the following subsection. We then deliver it to the detection head for foreground point segmentation and 3D proposal generation.

4.2 Image stream in RPN

The image stream, which is shown as the lower block in Fig. 2, receives a camera image as input and extracts image features through a CNN backbone. Like EPNet [6], we used an architecture consisting of four convolution blocks to extract image features. Each convolution block consists of two $3\times 3$ convolution layers, batch normalization [19], and ReLU activation functions with the second convolution layer of stride 2 as shown in the bottom of Fig. 2. In addition, to obtain multi-scale features, we use four parallel transposed convolution layers with different strides to create a feature map of the same size as the original image. Then, we concatenate them and obtain a multi-scale image feature $F^I_\text {total}$ containing semantic image information with different receptive fields. In Fig. 2, $F^I_i$, for $i=1,2,3,4$ represents the output of four convolution blocks, and $F^I_\text {total}$ represents the multi-scale image feature.

4.3 Graph feature fusion (GFF) module

As shown in Fig. 3, the image and point features are fused through the GFF module that takes point-wise image feature and point feature as inputs.

Fig. 3. The $i$-th GFF Module. The GFF module effectively fuses point features and the associated point-wise image features.

Download Full Size | PDF

The $i$-th GFF module, using two Graph CNNs, extracts the $i$-th fusion feature $F^F_i$ that complements the geometry structure relationship between the point and image features. As shown in Fig. 3, the point cloud feature $F^P_i$ and the point-wise image feature $F^{Ip}_i$ go through the Fusion Graph CNN Block, and mapped to the same channel.

(3)$$F^F_i = D(F^P_i) || D(F^{Ip}_i),$$

where $D$ represents the fusion Graph CNN block and $||$ the concatenation operation. $F^P$ is a LiDAR feature of shape ${N \times C}$. $F^{Ip}$ is a point-wise image feature of an adjacent pixel for the projection point $p'$ obtained through (1) and (2) and $F^{Ip}$ of shape ${ N \times C'}$.

As shown in Fig. 4, the fusion Graph CNN Block uses four edge convolution layers [10] to maintain geometric structures. In all layers, we used $k = 20$ for $k$-nn, and skip connections were used to extract multi-scale features. In addition, for all layers, the leaky ReLU activate function and batch normalization were used with the same settings as in [10]. By using edge convolution in the Graph CNN Block, which extracts non-local features, it is possible to fuse in consideration of the geometrical relationship between points without dealing with feature points independently. Additionally, the Fusion Graph CNN can generate equal dimensions for both $F^P$ and $F^{IP}$, even if they have different original dimensions. The resulting feature vectors $D(F^P)$ and $D(F^{IP})$ both have the same shape of $N \times C''$.

Fig. 4. Fusion Graph CNN Block, which uses four edge convolution layers to maintain geometric structures.

Download Full Size | PDF

While prior methods such as EPNet [6], PointNet [7], and PointRCNN [26] treat individual points independently, Fusion Graph CNNs take a different approach by creating local neighborhood graphs based on the feature edges adjacent to each point. This enables the model to incorporate local neighborhood information in a way that can be iteratively learned to capture overall shape properties. To preserve permutation invariance while constructing the local-neighbor graphs, edge functions are created to describe relationships between points and their neighbors. Convolutional-like operations are then applied to the edges that connect neighboring points, ensuring that geometric information is preserved. By doing so, the model reduces the loss of geometric information that can occur when fusing different features, leading to more efficient convergence between image and point cloud data.

In our case, the combined features are denoted as $F_f$, which is compressed into a single channel through another fully connected layer $\text {FC}(\cdot )$. The weight map $w$ is then normalized to $[0,1]$ using the sigmoid activate function $\sigma (\cdot )$.

(4)$$w=\sigma(\text{FC}(F^F)) .$$

After obtaining the weight map $w$, the point feature $F^P$ and the point-wise image feature $F^{Ip}$ are combined by concatenation. Through this, we obtain a fused feature $F^F$, which is supplemented with image semantic information and geometric structure information between feature points.

(5)$$F^F = F^P || w F^{Ip} .$$

As shown in Fig. 2, semantic information is complemented and geometric relationship information is improved by adding a fused feature $F^F_i$, where image and point features are fused in $\text {SA}_i$.

4.4 Loss function

We used the multi-task loss function proposed in PointRCNN [26] to train the network. We also used the consistency enhancement (CE) loss proposed in EPNet [6], which guarantees consistency between the classification and localization confidences. The total loss function of the proposed method is the sum of the point-image stream region proposal network (RPN) loss and the refinement network loss as defined in (6), and the loss at each stage is the sum of the classification, regression, and CE losses as defined in (7) and (8). The total loss is defined as

(6)$$\mathcal{L}_\text{total}=\mathcal{L}_\text{RPN} + \mathcal{L}_\text{RCNN} .$$

The RPN loss is defined as

(7)$$\mathcal{L}_\text{RPN} = \mathcal{L}^\text{cls}_\text{RPN} + \mathcal{L}^\text{reg}_\text{RPN} + \lambda \mathcal{L}^\text{CE}_\text{RPN},$$

and the refinement loss is defined as

(8)$$\mathcal{L}_\text{RCNN} = \mathcal{L}^\text{cls}_\text{RCNN} + \mathcal{L}^\text{reg}_\text{RCNN} + \lambda \mathcal{L}^\text{CE}_\text{RCNN},$$

where $\mathcal {L}^\text {cls}$, $\mathcal {L}^\text {reg}$, and $\mathcal {L}^\text {CE}$ respectively represent the classification, regression, and CE losses. $\lambda$ is a balance coefficient, and we used $\lambda = 5.0$ as in RENet [6].

5. Experimental results

5.1 Dataset and evaluation metrics

We evaluated the proposed method on KITTI 3D and BEV object detection benchmarks. The former consists of 7,481 training samples and 7,518 test samples. The training data consists of 3,712 train data and 3,769 validation data. Object detection results on the Kitty validation and test sets are evaluated using average precision (AP) calculated at 40 recall locations.

In addition, the KITTI dataset evaluates 3D object detection performance using the PASCAL criterion, and the difficulty is defined as Easy, Moderate, and Hard according to the size, occlusion, and cutting of the object.

5.2 Implementation details

In this paper, EPNet [6] is used as the reference model. The resolution of the image is $1280 \times 384$, and the range of LiDAR point clouds is camera coordinates X-axis (right) [-40,40], Y-axis (bottom) [-1,3], and Z-axis (front) [0,70.4] all in meters. The direction range of $\Theta$ is $[-\pi,\pi ]$.

The raw point cloud is used by sampling 16,384 points from the input LiDAR point cloud in the same way as PointRCNN [26]. If the number of points is less than 16,384, we randomly replicated sampled point to make up the difference. In addition, we used four set abstraction (SA) layers to sample points of size 4,096, 1,024, 256, and 64, and four feature propagation layers to recover the size of the point cloud for proposal generation and segmentation.

Camera images were is used in the same configuration as EPNet [6]. Specifically, we used four convolution blocks to downsample the input image, and each second layer of the block has the stride 2. In addition, four parallel transposed convolutions with strides of 2, 4, 8, and 16 were used to recover the resolution from feature maps of different scales. Finally, the regression loss [26] and the CE loss [6] in the training phase are only applied to positive proposals sharing an IoU greater than the ground truth or greater than 0.55 in the RCNN phase.

5.3 Experiments on KITTI dataset

We used the KITTI Dataset 3D object detection benchmark [27] to evaluate the efficiency of the proposed GFF module. Table 1 shows the evaluation results for the KITTI Dataset. We compare the proposed method with various 3D object detection methods using a fusion of Camera and LiDAR.

Table 1. Comparisons with state-of-the-art methods on the KITTI dataset (Car)

View Table | View all tables in this article

Our method achieved 5.33%p, 4.09%p, 0.9%p, and 0.5%p better performance than PI-RCNN [12], SegVoxelNet [18], 3D-CVF [19], and EPNet [6] in terms of 3D mAP, respectively. Our method outperformed 3D-CVF [19] for the case of Easy and Hard levels, whereas 3D-CVF [19] showed the best performance for the case of Moderate level. In addition, our method shows robust performance in 3D detection because it focuses on the fusion of point cloud. On the other hand, it can be seen that the bird’s eye view detection shows lower performance than other detectors.

Figure 5 is a visualization of the results for the KITTI data set. We can check if it detects distant objects (red boxes) and objects in complex situations (green boxes).

Fig. 5. Qualitative comparison of our method on the KITTI test split

Download Full Size | PDF

As shown in Figs. 6, 7, and 8, the proposed method can detect distant objects and objects of complex situations with high accuracy by supplementing the relationship information and geometric structure information between the image and the point cloud. We compare the results of the proposed method with the baseline method proposed in [6]. As shown in Fig. 9, the proposed method is demonstrated to be reliable in detecting small objects at a distance by combining the geometrical information from the graph CNN method with point cloud and image features.

Fig. 6. Visual comparison of the GFF module (bottom) with EPNet (top)

Download Full Size | PDF

Fig. 7. Visual comparison of the GFF Module (bottom) with EPNet (top) on the KITTI test split: Long Distance Object Detection

Download Full Size | PDF

Fig. 8. Visual comparison of the GFF Module (bottom) with EPNet (top) on the KITTI test split: Complex Scene Object Detection

Download Full Size | PDF

Fig. 9. Visual Comparison of GFF Module (Bottom) and EPNet (Top) in KITTI Test Split: Detection of Small and Distant Objects

Download Full Size | PDF

5.4 Ablation study

5.4 Effect of the number of edge convolutional layers

Table 2 shows the effect of the number of edge convolution layers [10] on the convergence performance of image and point cloud. The proposed method performed best when using four edge convolution layers. An appropriate number of edge convolution layers can be used to extract relational information feature between each sensor and ultimately improves fusion performance.

Table 2. Performance results according to the number of edge convolution layers on the KITTI val dataset

View Table | View all tables in this article

5.4 Visual comparison

Table 3 and Fig. 10 show the effect of the structure of the GFF module on the image and point cloud fusion performance. Element-wise addition of two different features can reflect some characteristics of the original features, but information loss occurs in the process. On the other hand, concatenation preserves the original position of features by directly connecting two different features, and the network learns to fuse the features. Unlike element-wise addition, no information is lost in this process. Therefore, in the GFF module, concatenation is used to effectively converge point-wise image and point features.

Fig. 10. Two different implementation of the GFF module: (a) Proposed concatenation method and (b) point-wise addition with tanh activation.

Download Full Size | PDF

Table 3. Performance results according to GFF modeling (a), (b) on the KITTI val dataset

View Table | View all tables in this article

5.4 Relationship between performance and GFLOPs

We investigated the relationship between performance and GPU floating-point operations (GFLOPs) by modifying the RPN network. We compared our proposed two-stage RCNN structure with two other 3D object detection methods: i) PointRCNN [26], which relies solely on LiDAR data, and ii) EPNet [6], which fusion camera and LiDAR data. We plotted the GFLOPs against the performance improvement for each method, as shown in Fig. 11.

Fig. 11. Relationship between mean average precision (mAP) and floating-point operations per second (GFLOPs) in the RPN network.

Download Full Size | PDF

Based on this experiment, we observed that performance enhancement increases GFLOPs. The fusion methods, proposed method and EPNet, showed significant differences in both GFLOPs and performance due to the addition of LiDAR and camera feature fusion modules. In addition, our proposed method further utilizes a graph network to fuse camera and LiDAR features while preserving the geometry, which leads to an increase in GFLOPs. However, our method demonstrates more robust performance compared to the other two methods, indicating the effectiveness of the proposed approach.

6. Conclusion

In this paper, we propose a robust, accurate 3D object detection method using fusion of camera and LiDAR data based on graph CNN. Using the proposed GFF module, it is possible to create a fusion feature that adds relational information between image and point cloud features using Graph CNN instead of the MLP and FC layers used to converge different features in existing sensor fusion methods. The performance of the GFF module was verified through an experiment on the KITTI dataset. However, there is a disadvantage of using only the local information of each sensor when fusing features. To solve this problem, we plan to study the field of sensor fusion using a transformer in the future, and based on this, we will propose a fusion method that considers not only local features but also global features.

Funding

Ministry of Science and ICT, South Korea (2020M3F6A111350); Korea Institute for Information and Communications Technology Planning and Evaluation (2021-0-01341, Artificial Intelligence Graduate School Program, Chung-Ang University).

Disclosures

The authors declare no conflicts of interest.

Data availability

Data underlying the results presented in this paper are not publicly available at this time but may be obtained from the authors upon reasonable request.

References

1. W. Shi and R. Rajkumar, “Point-GNN: graph neural network for 3D object detection in a point cloud,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, (2020), pp. 1711–1719.

2. Y. Zhang, X. Gao, Z. Chen, and H. Zhong, “Learning correlation filter with detection response for visual tracking,” in 2019 IEEE International Conference on Image Processing (ICIP), (IEEE, 2019), pp. 3990–3994.

3. Q. Chen, Y. Wang, T. Yang, X. Zhang, J. Cheng, and J. Sun, “You only look one-level feature,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, (2021), pp. 13039–13048.

4. C. Feng, Y. Zhong, Y. Gao, M. R. Scott, and W. Huang, “TOOD: Task-aligned one-stage object detection,” in 2021 IEEE/CVF International Conference on Computer Vision (ICCV), (IEEE Computer Society, 2021), pp. 3490–3499.

5. A. Zeng, K.-T. Yu, S. Song, D. Suo, E. Walker, A. Rodriguez, and J. Xiao, “Multi-view self-supervised deep learning for 6D pose estimation in the Amazon picking challenge,” in 2017 IEEE international conference on robotics and automation (ICRA), (IEEE, 2017), pp. 1383–1386.

6. T. Huang, Z. Liu, X. Chen, and X. Bai, “EPNet: Enhancing point features with image semantics for 3D object detection,” in European Conference on Computer Vision, (Springer, 2020), pp. 35–52.

7. C. R. Qi, H. Su, K. Mo, and L. J. Guibas, “PointNet: deep learning on point sets for 3D classification and segmentation,” in Proceedings of the IEEE conference on computer vision and pattern recognition, (2017), pp. 652–660.

8. C. R. Qi, L. Yi, H. Su, and L. J. Guibas, “PointNet++: deep hierarchical feature learning on point sets in a metric space,” Advances in neural information processing systems 30 (2017).

9. M. Simonovsky and N. Komodakis, “Dynamic edge-conditioned filters in convolutional neural networks on graphs,” in Proceedings of the IEEE conference on computer vision and pattern recognition, (2017), pp. 3693–3702.

10. A. V. Phan, M. Le Nguyen, Y. L. H. Nguyen, and L. T. Bui, “DGCNN: a convolutional neural network over large-scale labeled graphs,” Neural Networks 108, 533–543 (2018). [CrossRef]

11. M. Liang, B. Yang, S. Wang, and R. Urtasun, “Deep continuous fusion for multi-sensor 3D object detection,” in Proceedings of the European conference on computer vision (ECCV), (2018), pp. 641–656.

12. L. Xie, C. Xiang, Z. Yu, G. Xu, Z. Yang, D. Cai, and X. He, “PI-RCNN: an efficient multi-sensor 3D object detector with point-based attentive cont-conv fusion module,” in Proceedings of the AAAI conference on artificial intelligence, vol. 34 (2020), pp. 12460–12467.

13. M. Simon, K. Amende, A. Kraus, J. Honer, T. Samann, H. Kaulbersch, S. Milz, and H. Michael Gross, “Complexer-YOLO: real-time 3D object detection and tracking on semantic point clouds,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, (2019), p. 0.

14. J. Dou, J. Xue, and J. Fang, “SEG-VoxelNet for 3D vehicle detection from RGB and LiDAR data,” in 2019 International Conference on Robotics and Automation (ICRA), (IEEE, 2019), pp. 4362–4368.

15. G. P. Meyer, J. Charland, D. Hegde, A. Laddha, and C. Vallespi-Gonzalez, “Sensor fusion for joint 3D object detection and semantic segmentation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, (2019), p. 0.

16. S. Pang, D. Morris, and H. Radha, “CLOCS: camera-lidar object candidates fusion for 3D object detection,” in 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), (IEEE, 2020), pp. 10386–10393.

17. A. Asvadi, L. Garrote, C. Premebida, P. Peixoto, and U. J. Nunes, “Multimodal vehicle detection: fusing 3D-lidar and color camera data,” Pattern Recognit. Lett. 115, 20–29 (2018). [CrossRef]

18. H. Yi, S. Shi, M. Ding, J. Sun, K. Xu, H. Zhou, Z. Wang, S. Li, and G. Wang, “SegVoxelNet: exploring semantic context and depth-aware features for 3D vehicle detection from point cloud,” in 2020 IEEE International Conference on Robotics and Automation (ICRA), (IEEE, 2020), pp. 2274–2280.

19. J. H. Yoo, Y. Kim, J. Kim, and J. W. Choi, “3D-CVF: generating joint camera and lidar features using cross-view spatial feature fusion for 3D object detection,” in European Conference on Computer Vision, (Springer, 2020), pp. 720–736.

20. X. Chen, H. Ma, J. Wan, B. Li, and T. Xia, “Multi-view 3D object detection network for autonomous driving,” in Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, (2017), pp. 1907–1915.

21. J. Ku, M. Mozifian, J. Lee, A. Harakeh, and S. L. Waslander, “Joint 3D proposal generation and object detection from view aggregation,” in 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), (IEEE, 2018), pp. 1–8.

22. V. A. Sindagi, Y. Zhou, and O. Tuzel, “MVX-Net: multimodal VoxelNet for 3D object detection,” in 2019 International Conference on Robotics and Automation (ICRA), (IEEE, 2019), pp. 7276–7282.

23. S. Vora, A. H. Lang, B. Helou, and O. Beijbom, “PointPainting: sequential fusion for 3D object detection,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, (2020), pp. 4604–4612.

24. C. R. Qi, X. Chen, O. Litany, and L. J. Guibas, “ImVoteNet: boosting 3D object detection in point clouds with image votes,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, (2020), pp. 4404–4413.

25. K. Huang and Q. Hao, “Joint multi-object detection and tracking with camera-lidar fusion for autonomous driving,” in 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), (IEEE, 2021), pp. 6983–6989.

26. S. Shi, X. Wang, and H. Li, “PointRCNN: 3D object proposal generation and detection from point cloud,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, (2019), pp. 770–779.

27. A. Geiger, P. Lenz, and R. Urtasun, “Are we ready for autonomous driving? the KITTI vision benchmark suite,” in 2012 IEEE conference on computer vision and pattern recognition, (IEEE, 2012), pp. 3354–3361.

28. C. R. Qi, W. Liu, C. Wu, H. Su, and L. J. Guibas, “Frustum PointNets for 3D object detection from RGB-D data,” in Proceedings of the IEEE conference on computer vision and pattern recognition, (2018), pp. 918–927.

29. X. Du, M. H. Ang, S. Karaman, and D. Rus, “A general pipeline for 3D detection of vehicles,” in 2018 IEEE International Conference on Robotics and Automation (ICRA), (IEEE, 2018), pp. 3194–3200.

30. Y. He, G. Xia, Y. Luo, L. Su, Z. Zhang, W. Li, and P. Wang, “DVFENet: dual-branch voxel feature extraction network for 3D object detection,” Neurocomputing 459, 201–211 (2021). [CrossRef]

31. L. Wang, C. Wang, X. Zhang, T. Lan, and J. Li, “S-AT GCN: spatial-attention graph convolution network based feature enhancement for 3D object detection,” arXiv, arXiv:2103.08439 (2021). [CrossRef]

32. L. Zhao, M. Wang, and Y. Yue, “Sem-Aug: improving camera-lidar feature fusion with semantic augmentation for 3D vehicle detection,” IEEE Robotics Autom. Lett. 7(4), 9358–9365 (2022). [CrossRef]

33. E. Yurtsever, E. Erçelik, M. Liu, Z. Yang, H. Zhang, P. Topçam, M. Listl, Y. K. Çaylı, and A. Knoll, “3D object detection with a self-supervised lidar scene flow backbone,” arXiv, arXiv-2205 (2022). [CrossRef]

	Car (3D Detection)				Car (Bird’s Eye View)
Benchmark	Easy	Moderate	Hard	mAP	Easy	Moderate	Hard	mAP
F-Pointnet [28]	82.19	69.79	60.59	70.86	91.17	84.67	74.77	83.54
MV3D [20]	74.97	63.63	54.00	64.20	86.62	78.93	69.80	78.45
AVOD [21]	76.39	66.47	60.23	67.70	89.75	84.95	78.32	84.34
AVOD-FPN [21]	83.07	71.76	65.73	73.52	90.99	84.82	79.62	85.14
ContFuse [11]	83.68	68.78	61.67	71.38	94.07	85.35	75.88	85.10
PC-CNN [29]	85.57	73.79	65.65	75.00	91.19	87.40	79.35	85.98
PI-RCNN [12]	84.37	74.82	70.03	76.41	91.55	85.81	81.00	86.12
SegVoxelNet [18]	86.04	76.13	70.76	77.64	91.62	86.37	83.04	87.01
3D-CVF [19]	89.20	80.05	73.11	80.79	93.52	89.56	82.45	88.51
EPNet [6]	89.81	79.28	74.59	81.23	94.22	88.47	83.69	88.79
DVFENet [30]	86.20	79.18	74.58	79.99	90.93	87.68	84.60	87.74
S-AT GCN [31]	83.20	76.04	71.17	76.80	90.85	87.68	84.20	87.58
Sem-Aug [32]	86.69	78.06	73.85	79.53	91.68	85.88	83.37	86.97
SSL-PointGNN [33]	87.78	79.36	74.15	80.43	92.92	89.16	83.99	88.69
Ours	90.52	79.71	74.98	81.74	94.64	88.57	83.90	89.03

Layer	Easy	Moderate	Hard	mAP
1	91.05	80.04	77.77	82.95
4	92.27	82.47	80.29	85.01
6	91.98	82.26	79.95	84.73

Model	Easy	Moderate	Hard	mAP
GFF (a)	92.27	82.47	80.29	85.01
GFF (b)	91.71	82.17	80.02	84.63

	Car (3D Detection)				Car (Bird’s Eye View)
Benchmark	Easy	Moderate	Hard	mAP	Easy	Moderate	Hard	mAP
F-Pointnet [28]	82.19	69.79	60.59	70.86	91.17	84.67	74.77	83.54
MV3D [20]	74.97	63.63	54.00	64.20	86.62	78.93	69.80	78.45
AVOD [21]	76.39	66.47	60.23	67.70	89.75	84.95	78.32	84.34
AVOD-FPN [21]	83.07	71.76	65.73	73.52	90.99	84.82	79.62	85.14
ContFuse [11]	83.68	68.78	61.67	71.38	94.07	85.35	75.88	85.10
PC-CNN [29]	85.57	73.79	65.65	75.00	91.19	87.40	79.35	85.98
PI-RCNN [12]	84.37	74.82	70.03	76.41	91.55	85.81	81.00	86.12
SegVoxelNet [18]	86.04	76.13	70.76	77.64	91.62	86.37	83.04	87.01
3D-CVF [19]	89.20	80.05	73.11	80.79	93.52	89.56	82.45	88.51
EPNet [6]	89.81	79.28	74.59	81.23	94.22	88.47	83.69	88.79
DVFENet [30]	86.20	79.18	74.58	79.99	90.93	87.68	84.60	87.74
S-AT GCN [31]	83.20	76.04	71.17	76.80	90.85	87.68	84.20	87.58
Sem-Aug [32]	86.69	78.06	73.85	79.53	91.68	85.88	83.37	86.97
SSL-PointGNN [33]	87.78	79.36	74.15	80.43	92.92	89.16	83.99	88.69
Ours	90.52	79.71	74.98	81.74	94.64	88.57	83.90	89.03

Layer	Easy	Moderate	Hard	mAP
1	91.05	80.04	77.77	82.95
4	92.27	82.47	80.29	85.01
6	91.98	82.26	79.95	84.73

Fusion of an RGB camera and LiDAR sensor through a Graph CNN for 3D object detection

Abstract

1. Introduction

2. Related works

2.1 Deep learning methods for the 3D point cloud

2.1 Point-based methods

2.1 Graph-based methods

2.2 Fusion of the image and point cloud

2.2 Early fusion

2.2 Late fusion

2.2 Deep fusion

2.3 3D object detection using fusion of the image and point cloud

3. Background

3.1 Association of the image and point cloud

3.2 Refinement network

4. Proposed method

4.1 Point stream in RPN

4.2 Image stream in RPN

4.3 Graph feature fusion (GFF) module

4.4 Loss function

5. Experimental results

5.1 Dataset and evaluation metrics

5.2 Implementation details

5.3 Experiments on KITTI dataset

5.4 Ablation study

5.4 Effect of the number of edge convolutional layers

5.4 Visual comparison

5.4 Relationship between performance and GFLOPs

6. Conclusion

Funding

Disclosures

Data availability

References

Data availability

Cited By

Figures (11)

Tables (3)

Equations (8)

Optics Continuum