Paper review: DSVT(CVPR 2023)

Paper review: DSVT(CVPR 2023)

2024. 10. 18. 15:36ㆍReview/- 3D Object Detection

DSVT: Dynamic Sparse Voxel Transformer with Rotated Sets

Motivation

3D backbone to handle sparse point clouds is a fundamental problem in 3D perception. Compared with the sparse convolution, the attention mechanism in Transformers is more appropriate and is easier to be deployed in real-world applications. However due to the sparse characteristics of point clouds, it is non-trival to apply a standard transformer on sparse points. In this paper authors present Dynamic Sparse Voxel Transformer (DSVT) with propsing Dynamic Sparse Window Attention and attention-style 3D pooling module on sparse points with out customized CUDA operations.

Main Idea

[Figure-1] Detection merformance of different methods on Waymo validation set.

Unlike the weel-studied 2D community where the input images is in a densely packed array, 3D ppoint coulds are

sparsedly and irregulary distributed in continuous space due to the nature of 3D sensors.

Overview

In First converts the input point clouds into sparse voxels by a voxel feature encoding (VFE) module like voxelnet. And then follow the prposed DSVT blocks and 3D pooling module with shift, different window size methods from Swin transformer.

Dynamic Sparse Window Attention

The number of non-empty voxels in each window may vary significantly, which makes directly appling a standard Transformer non-trival. Previous methods use padding, sampling or grouping. but they suffer from redundant computations or unstable performance. So authors propose Dynamic Sparse Window Attention, a window-based attention strategy for efficiently handling sparse 3D voxels in parallel.

After converting points to 3D voxels, they are further partitioned into a list of non-overlapping 3D windows with fixed size $L × W × H$, like previous window-based approaches.

As for a specific window, it has $N$ non-empty voxels, where $(x_i, y_i, z_i)$ ∈ $\mathbb{R}^3$ and $f_i$ ∈ $\mathbb{R}^C$ denote the coordinates and features of sparse voxels, respectively. $d_i$ ∈ $[1, N]$ is the corresponding inner-window voxel ID, which can be generated by sorting strategy of these voxels. To generate non-overlapped and size-equivalent local sets, they first compute the required number of sub-sets in this window.

[Eq-2] Number of voxel sub-sets in each window

$\tau$ is a hyperparameter that indicates the maximum number of non-empty voxels allocated to each set. Notably, S dynamically varies with the sparsity of the window. With the number of assigned sets $S$, they evenly distribute $N$ non-empty voxels into $S$ sets. The voxel indices that belong to $j$-th set (denoded as $\mathcal{Q}_j = \{q_k^j\}_{k=0}^{\tau-1}$).

This operation cna generate a specific number of voxels fro each set, regardless of N, for parallel manner. So they mask redundant voxels. After obtaining the partiion $\mathcal{Q}_j$ of $j$-th set, they then collect the corresponding voxel features and coordinates based on the voxel inner-window id $D$ = $\{d_i\}_{i=1}^N$.

where $INDEX(_{'voxels, 'partition, 'ID'})$ is index operation. $F_j$ ∈ $\mathbb{R}^{\tau × C}$ and $\mathcal{O}_j$ ∈ $\mathbb^{\tau x 3}$ are the corresponding voxel features and spatial coordinates $(x, y, z) of this set. However computing self-attention inside the invariant partition lacks connections across the subsets, limiting its modeling power. To bridge the voxels among the non-everlapping sets, they propose the rotated-set attention approach that alternates between two partitioning configurations in consecuteive attention layers.

where $D_x$ and $D_y$ are the inner-window voxel index sorted in X-Axis and Y-Axis respectively. $F$ ∈ $\mathbb{R}^{S x \tau x C}$ and $\mathcal{O}$ ∈ $\mathbb{R}^{S x \tau x 3}$ are the corresponding indexed voxel features and coordinates of all sets.

Speacifically they follow the swin-transformer using a window shifting technique between two consecuteive DSVT blocks to re-partition the sparse window, but their window size is different.

Attention-style 3D Pooling

They observed that simply padding the sparse regions and applying an MLP network for downsampling will drop the performance. Moreover, they observe that inserting an MLP between transformer blocks also harms network optimization. To support effective 3D down-sampling and better encde spatial information, they present an attention-style 3D pooling operation.

Given a pooling local region $l × w × h$ with non-empty voxels $\{p_i\}^P_{i=1}$, they first pad this sparse region into dense, $\{\tilde{p_i}\}^{l ×w ×h}_{i=1}$, and perform standard max-pooling along the voxel dimension.

Instead of directly feeding the pooled feature, they use pooled feature $\mathcal{P}$ to construct the query vector, while the original unpooled $\{\tilde{p_i}\}^P_{i=1}$ serves that role of key and value vetors.

Experiments

Evaluation

Waymo Dataset

[Table-1] The results of 3D object detection

NuScenes Dataset

[Table-2] The results of 3D object detection and BEV map segmentation

Ablation Study

[Table-3] Comparison with sparse convolution

[Table-4] Comparison with standard self-attention

[Table-5] Effect of set partition method

[Table-6] Effect of hybrid window partition

[Table-7] Effect of 3D sparse pooling module

[Table-8] The latency and performance on Waymo validation set

Conclusion

In this paper, authors propose DSVT, a deployment-friendly yet powerful transformer-only backbone for 3D perception. To effciently handle sparse point clouds, they introduce dynamic sparse window attention, Attention-style 3D Pooling without any customized CUDA operations. Thus proposed DSVT can be accelerated by well-optimized NVIDIA TensorRT, which achieves SOTA performance on various 3D perception benchmarks with real-time running speed.

Reference

[Figure-1~2, Eq-1~7, Table-1~8]: https://arxiv.org/pdf/2301.06051

'Review > - 3D Object Detection' 카테고리의 다른 글

Paper review: UniTR (CVPR 2023) (0)	2024.11.11
Paper review: BEVFusion (ICRA 2023) (0)	2024.09.20
Paper review: SupFusion (ICCV 2023) (0)	2024.07.22

find-knowledge