Paper review: BEVFusion (ICRA 2023)

2024. 9. 20. 16:06Review/- 3D Object Detection

BEVFusion: Multi-Task Multi-Sensor Fusion with Unified Bird's-Eye View Representation

 

Motivation

Recent approaches are based on point-level fusion: augmenting the LiDAR point cloud with camera features. However the camera-to-LiDAR projection throws away the semantic density of camera features. In this paper, authors propose BEVFusion. It unifies multi-modal features in the shared BEV representation space.

 

Main Idea

Cameras capture rich semantic information, LiDARs provide accurate spatial information, while radars offer instant velocity estimation. Thus data from different sensors are expressed in fundamentally different modalities. To resolve this view discrepancy, they propose BEVFusion to unifiy multi-modal features in a shared BEV representation space for task-agnositc learning.

[Figure-1] Architecture of BEVFusion

Previous Method

LiDAR-Based 3D Perception

Researchers have designed single-stage 3D object detectors that extract flattened point cloud features using PointNets or SparseConvNet and perform detection in the BEV space.

Camera-Based 3D perception

Due to the high cost of LiDAR sensors, researchers spend significant efforts on camera-only 3D persception. FCOS3D extends detector with additional 3D regression branches. Instead of performing object detction in the performing object detection in the perspective view, design a DETR-based detection head with learning object queries in the 3D space. Another 3D perception models explicitly converts the camera features from perspective view BEV using a view transformer.

 

Multi-sensor Fusion

Proposal-level

FUTR3D and TransFusion define object queries in the 3D space and fuses image features onto these proposals. All proposal-level fusion methods are object-centric and cannot trivially generalize to other tasks such as BEV map segmentation.

 

Point-level

LiDAR-to-camera project the LiDAR point cloud to the camera plane and render the 2.5D sparse depth. However this conversion is geometrically lossy. Two neighbors on the depth map can be far away from each other in the 3D space.

Camera-to-LiDAR use camera features with their correspoinding LiDAR points. However this camera-to-LiDAR projection is semantically lossy. it have drastically diffferent desnsities, resulting in only less than 5% of camera features being matched to a LiDAR point.

However They are both object-centric and geometric-centric. And because of semantically lossy these point-level fusion methods barely work on semantic-oriented tasks. (BEV map segmentation)

BEVFusion Method

[Figure-2] Difference with BEVFusion and others

This view is friendly to almost all preception tasks. More importantly, the transformation to BEV keeps both geometric structure and semantic density. LiDAR-to-BEV projection flattens the sparse LiDAR features along the height dimension, thus does not create gemetric distortion in Figure-2(a). Camera-to-BEV projection casts each camera feature pixel back into a ray in the 3D space.

Efficient Camera-to-BEV Transformation

[Figure-3] Camera-to-BEV transformation

Following LSS, they explicitly predict the discrete depth distribution of each pixel. They then scatter each feature pixel into D discrete points along the camera ray and rescale the associated features by their corresponding depth probabilities. This generates a camera feature point cloud of size $NHWD$, where $N$ is the number of cameras and $(H, W)$ is the camera feature map size. Such 3D feature point cloud is quantized along the $x$, $y$ axes with a step size of $r$. They use the BEV pooling operation to aggregate all feature within each $r$ x $r$ BEV grid and flatten the features along the $z$-axis.

Though simple, BEV pooling i suprisingly inefficient and slow, taking more than 500ms on an RTX 3090 GPU (while the rest of model only takes around 100ms). This is because the camera feature point cloud is very large. To lift this efficiency bottleneck, they propose to optimize the BEV pooling with precomputation and  interval reduction.

 

Precomputation. The first step of BEV pooling is to associate each point in the camera feature point cloud with a BEV grid. Different from LiDAR point clouds, the coordinates of the camera feature point cloud are fixed. Because the camera intrinsics and extrinsics stay same. So they precompute the 3D coordinate and the BEV grid index of each point. They also sort all points according to grid indices and record the rank o feach point. During inference, they only need to reorder all feature points based on the precomputed ranks. This caching mechanism can reduce the latency of grid association from 17ms to 4ms.

 

Interval Reduction. The next step of BEV pooling is to aggregate the features within each BEV grid by some symmetric function.They first computes the prefix sum over all points and then subtracts the values at the boundaries where indices change. However the prefix sum operation requires tree reduction on the GPU and produces many unused partial sums, both of which are inefficient. To accerlerate feature aggregation, they implement a specialized GPU kernel they parallelizes directly over BEV grids: they assign a GPU thread to each grid that calculates its interval sum and writes the result back. This reduce the layency of feature aggregation from 500ms to 2ms.

 

Takeways. The camera-to-BEV transformation is 40x faster with their optimized BEV pooling: the latency is reduced from more than 500mss to 12ms and scales well across different feature resoilutions. Two concurrent works of their also identify this efficiency bottleneck in the camera-only 3D detection. Proposed techniques are exact without any approximation, while still being faster.

Fully-Convolutional FUsion

With all sensory features converted to the shared BEV reppresentation, they can easily fuse them together with an elementwise operator. Though in the same space, LiDAR BEV features and camera BEV features can still be spatially misaligned due to the inaccurate depth in the view transformer. To this end they apply a convolution-based BEV encoder to compensate for such local misalignments.

 

Multi-Task Heads

Their method is applicable to most 3D perception tasks. For 3D object, they use a class-specific center heatmap head to predict the center location of all object and a few regression heads to estimate the object size, rotation, and velocity. For map segmentation, different map categories may overlap. Therefor they formulate this problem as multiple binary semantic segmentation, one for each class.

 

Experiments

They evaluate BEVFusion for camera-LiDAR fusion on 3D object detection and BEV map segmentation, covering both geometric- and semantic-oriented tasks. Their framework can be easily extended to support other types of sensors and other 3D perception tasks. They use Swin-T as image backbone and VoxelNet as LiDAR backbone. they apply FPN to fuse multi-scale camera features to produce a feature map of 1/8 input size. they downsample camera images to 256x704 and voxelize the LiDAR point cloud with 0.075m for detction and 0.1m for segmentation. they apply grid sampling with bilinear interpolation before each task-specific head to explicitly transform between different BEV feature maps.

 

3D object detection

They experiment on the geometric-centric 3D object detection benchmark. 

[Table-1] Comparison with SOTA on nuScenes
[Table-2] Comparison with SOTA on waymo

They argue that the efficiency gain of BEVFusion comes from the fact that they choose the BEV space as the shared fusion space, which fully utilizes all camera features instead of just a 5% sparse set.

 

BEV Map Segmentation

They compare BEVFusion with SOTA on semantic-centric BEV amp segmentation task. For each frame they only perform the evaluation in the [-50m, 50m] x [-50m, 50m] region around the ego car.

[Table-3] Comparison with SOTA on nuScenes

Camera-only BEVFusion model outperforms LiDAR-only baselines by 8-13$. This observation is the exact opposite of results in Table 1. In the multi-modality setting monocular BEVFusion outperfoms SOTA sensor fusion methods. This is because both baseline methods are object-centric and geometric-oriented. This approaches are not help ful for segmentation map components.

 

Analysis

Weather and Lighting

[Table-4] Comparison with SOTA under different lighting and weather conditions

 

Sizes and Distances

[Figure-4] Comparison with SOTA under more chanllenging settings

 

Conclusion

BEVFusion unifies camera ans LiDAR features in a shared BEV space that fully preserves geometric and semantic information. To achieve this they accelerate the slow camera-to-BEV transformation. BEVFusion also outperforms all existing sensor fusion methods on waymo open dataset.

 

References

[Figure-1~4, Table-1~4] https://arxiv.org/pdf/2205.13542

'Review > - 3D Object Detection' 카테고리의 다른 글

Paper review: UniTR (CVPR 2023)  (0) 2024.11.11
Paper review: DSVT(CVPR 2023)  (0) 2024.10.18
Paper review: SupFusion (ICCV 2023)  (0) 2024.07.22