2024. 7. 25. 14:37ㆍReview/- Network
DCNv1: Deformable Convolutional Networks
Motivation
A key challenge in visual recognition is how to accomodate geometric variations or model geometric transformations. In general, there are two ways. The first is to build the training datasets with sufficient desired variations by augmenting the existing data samples like affine transformation. The second is to use transformation-invariant features and algorithms such as SIFT and sliding window. There are two drawbacks in above ways First the geometric transformation are assumed fixed and known from heuristic manual to augment the data and design the feature and algorighms. Scond, hand crafted design of invaiant features and algorithms could be difficult or infeasible for overly complex transforms. So authors introduce enw modules to enhance the transformation modeling capability of CNNs, namely deformable convolution and deformable RoI pooling.
Main Idea
Deformable Convolution
It adds 2D offsets to the regular grid sampling location in the standard convolution. The offsets are learned from the preceding feature maps, via additional convolutional layers.


The 2D convolution consist of two steps: 1) sampling using a regular grid $R$ over the input feature map x. 2) summation of smapled values weighted by w. The grid $R$ defines the receptive field size and dilation.

In deformable convolution, the regular grid $R$ is augmented with offsets {$\Delta p_n|n=1,..., N$}, where $N$ = $|R|$.

As the offset $\Delta p_n$ is typlically fractional, [Eq-2] is implemented via bilinear interpolation.

The offsets are obtained by applying a convolution layer over the same input feature map. the convolution kernel is the same spatial resolution and dilation as those of the current convolutional layer. The ouput offset field have the same spatial resolution with the input feature map. The channel dimension 2N corresponds to N 2d offsets.
Deformable RoI Pooling
RoI pooling is used in all region proposal based object detection methods. It convert an input rectangular region of arbirary size in to fixed size features.

RoI pooling divides the RoI into $k$ x $k$ bins and output a $k$ x $k$ feature map y.

where $n_{ij}$ is the number of pixels in the bin. The $(i, j)$-th bin spans $\left \lfloor{i\frac{w}{k}}\right \rfloor$ ≤ $p_x$ < $\left \lceil{(i+1)\frac{w}{k}}\right \rceil$ and $\left \lfloor{j\frac{h}{k}}\right \rfloor$ ≤ $p_y$ < $\left \lceil{(j+1)\frac{h}{k}}\right \rceil$.

Also $\Delta p_{ij}$ is fractional, [Eq-5] is implemented by bilinear interpolation.
[Figure-3] illustrated how to obtain the offsets. Firstly RoI pooling generates the pooled feature maps. From the maps a fc layer generates the normalized offstet $\Delta \hat{p_ij}$, which are then transformed to the offsets $\Delta p_{ij}$ by element-wise product with the RoI's width and height, as $\Delta p_ij$ = $\gamma \Delta \hat{p_{ij}} ◦(w,h)$. $\gamma$ is pre-defined scalar to modulate the magnitude of the offsets.
Experiments
Adaptive receptive field of deformable convolution have more large receptive field than fixed receptive filed of standard convolution. When the deformable convolution ar stacked, the effect of composited deformation is profound.



We can check performance of deformable convolution and RoI pooling below table.


Also we can compare deformable convolution with atrous convolution. Simply we can think deformable convolution is generalized atrous convolution.

At last we can compare model complexity and runtime comarison of deformable convnets and plain couterparts.

Conclusion
Authors introduce deformable ConvNets, which is a simple efficient, deep and end-to-end solution to model dense spatial transformations. For the first time they show that it is feasible and effective to learn dense spatial transformation in CNNs.
Reference
[Figure-1~6, Eq-1~5, Table-1~4]: https://arxiv.org/pdf/1703.06211
'Review > - Network' 카테고리의 다른 글
| Paper review: DCNv2 (CVPR 2019) (0) | 2024.09.03 |
|---|---|
| Paper review: InternImage (CVPR 2023) (0) | 2024.08.11 |
| Paper review: SHViT (CVPR 2024) (0) | 2024.05.31 |
| Paper review: Swin Transformer (ICCV 2021) (0) | 2024.04.19 |
| Paper review: MLP-mixer (NeurIPS 2021) (0) | 2024.04.17 |