Paper review: RepVGG(CVPR 2021)

2023. 7. 24. 16:55Review/- Network

Making VGG-style ConvNets Great Again

Motivation

Previous model that have multi branch like VGG require more time than single branch, So authors propose reparameterization VGG that combine multi branch with convolution and batch normalization layer after training. It can shorten inferrence time and reduce parameters.

Main Idea

Recent architecture are based on automatic or manual architecture search or compound scaling strategy. Though many complicated ConvNets deliver higher accuracy, the drawbacks are significant. 1) the complicated multi-branch model like Resnet or Densenet make the model difficult to implement and customize and slow down inference time and reduce memory utilization. 2) Xception and Mobilenets that use depthwise conv and ShuffleNets that have channel shuffle are increase the memory access cost and lack support of various devices. So authors propose RepVGG to remove these drawbacks.

Architecture

[IMG-1] Architecture

VGG model have multi-branch architecture and drawbacks. Authors propose decouple the multi branch in train and inference using structural re-parameterization. RepVGG has 5 stage and reduces down-sampleing via stride 2 convolution at beginning of each stage. Each stage has plain VGG in training and has 3x3 conv and ReLU in inference. So first layer of stage have 2 of stride and 3x3 conv branch, 1x1 conv branch without identity branch.

Re-parameterization

After training, we perform the transformation with simple algebra, as an identity branch canbe regarded as a degraded 1x1, and the latter can be further regarded as a degraded 3x3 conv. So we can construct a single 3x3 kernel with trained parameters of 3x3 kernel, 1x1 kernel, identity, batch normalization layers. Consequently the transformed model has stack of 3x3 conv layers

[IMG-2] Re-parameterization

Convolution layer

F = input ∗ $\omega$ + $/beta$.

This layer have $\omega$(weights) with ($k_{h}$, $k_{w}$, $c_{in}$, $c_{out}$) shape and $\beta$(bias) with ($c_{out}$) shape. But we do not use bias when convolution layer is fllowed by batch normalization layer.

Batch normalization layer

F = (input - $\mu$) * $\gamma$ / $\sigma$ + $\beta$ = input * $\gamma$ / $\sigma$ + $\beta$ - $\mu$ * $\gamma$ / $\sigma$.

This layer have $\mu$(moving_mean), $\sigma$(moving_variance), $\gamma$(gamma), $\beta$(beta) with ($c_{out}$) shape.

Conv3x3 + BN

F = input ∗ $\omega_{3}$ * $\gamma$ / $\sigma$ + $\beta$ - $\mu$ * $\gamma$ / $\sigma$ = input ∗ $\omega_{3}^{'}$ + $\beta^{'}$

Conv1x1 + BN

F = input  $\omega_{1}$ * $\gamma$ / $\sigma$ + $\beta$ - $\mu$ * $\gamma$ / $\sigma$ = input  $\omega_{1}^{'}$ + $\beta_{1}^{'}$

pad($\omega_{1}^{'}$) = $\omega_{3}^{''}$, $\beta^{''}$ = $\beta_{1}^{'}$

F = $\omega_{3}^{''}$ + $\beta^{'}$

Identity

F = pad(zeros[1, 1, i, i]) + 0 = $\omega_{3}^{'''}$ + $\beta^{'''}$

$\beta^{'''}$ = 0

Transformed Convolution layer

F = input ∗ ($\omega_{3}^{'}$ + $\omega_{3}^{''}$ + $\omega_{3}^{'''}$) + $\beta^{'}$ + $\beta^{''}$ + $\beta^{'''}$

Effects of re-parameterization

There are at least three reasons for using simple ConvNets

Fast

Multi-branch architectures have lower theoretical FLOPs than VGG but may not run faster.

[IMG-3] FLOPs on NVIDIA 1080ti

Nvidia device is optimized on 3x3 conv, but recent devices can be diffrent, so we need refer to table.

Memory-economical

The multi-branch topology is memory inefficient because the results of every branch need to be kept until the addition or concatenation. In contrast a plain topology allows the memory occupied by the inputs to a specific layer to be immediately released when the operation is finished.

Flexible

The multi-branch topology imposes constraints on the architectural specification, and libmits the application of channel pruning. In contrast, a plain architecture allows us to freely configure every conv layer according to our requirements and prune to obtain a better performance-efficiency trade-off

Conclusion

We can accerlate inference time using re-parameterization. So we need to consider network architecture for inference mode with performance. Some network have long time and parameters in train than the other, but it can be turn over in inference model. 

Reference

[IMG-1-3]: https://arxiv.org/pdf/2101.03697.pdf

[Simple Implementation]: https://github.com/kongbuhaja/RepVGG