2023. 8. 23. 00:00ㆍReview/- 2D Object Detection
Motivation
Vision Transformers (Vits) supersed ConvNet as the state-of-the-art image classification model since 20s. But A vanilla ViT faces difficulties when applied to general computer vision tasks such as object detection and semantic segmentation. So model based ViTs(e.g. Swin Transformers) use ConvNet priors. From these reasons authors try to modernize a standard ResNet toward the design of a ViT, and discover several key componets that contribute to performance differencee along the way. The outcom of this exploration is dubbed ConNeXt.
Main Idea
Modernize road map
The color bars are model accuracies in the ResNet-50/Swin-T FLOPs, results for the ResNet-200/Swin-B regime are shown with the gray bars. A hatched bars means the modification is not adopted.
Macro Design
Swin Transformers follow ConvNets to use a multi-stage design, where each stage has a different feature map resolution. There are two interesting design considerations: the stage compute ratio, and the stem cell structure.
Stage compute ratio
The original design of the computation distribution across stages in ResNet was largely empirical. Swin Transformer follow the same principle but with a slightly different stage compute ratio of 1:1:3:1. For larger Swin Transformers, the ratio is 1:1:9:1. Following these design, authors adjust the number of blocks in each stage from (3, 4, 6, 3) in ResNet-50 to (3, 3, 9, 3). This improves the model accuracy from 78.8% to 79.4%.
Stem to "Patchify"
The Stem cell design is concerned with how the input images will be processed at the network's beginning. Due to the redundancy inherent in natual images a common stem cell will aggresively downsample the input images to an appropriate feature map size in both standard ConvNets and ViTs. The stem cell in standard ResNet contains a 7x7 convolution layer with stride 2, followed by max pool, which results in a 4x downsampling of the input images. In ViTs a more aggressive "patchfy" strategy is used a s the stem cell, which corresponds to a large kernel size(e.g size=14 or 16) and non-overlapping convolution. Swin Transformer uses a similar "patchify" layer, but with a maller patch size of 4 to accommodate the architecture's multi-stage design. Authors replac the ResNet-style stem cell with a patchify layer implemented using a 4 x 4, stride 4 convolutional layer. The accuracy has changed from 79.4% to 79.5%.
ResNeXt-ify
Authors attempt to adopt the idea of ResNeXt, which has a btter FLOPs/accurac trade-off the a vanila ResNet. The core component is grouped convolution, where the convolutional filters are separated into different groups. At a high level, ResNeXt's guiding principle is to "use more groups, expand width". More precisely, ResNext employs group convoluion of the 3 x 3 conv layer in a bottlenect block. It reduce the FLOPs, then width can be expanded to compensate for the capacity loss.
Authors use depthwise convolution, that is similar to the weighted sum operation in self-attention, which operates on a per-channel basis. The combination of depthwise conv and 1 x 1 convs leads to a separation of spatial and channel mixing each dimentions, also depthwise convolution effectively reduces the network FLOPs and as expected the accuracy. Authors increase the network to the same number of channels as swin-Transfomer's (from 64 to 96). This brings the nework performance to 80.5% with increased FLOP's (5.3G).
Inverted Bottleneck
One important design in every Transformer block is that it creates an inverted bottleneck, the hidden dimention of the MLP block is four times wider than input dimension. This design is connected to the inverted bottleneck design with an expansion ratio of 4 used in ConvNets. Despite the increased FLOPs for the depthwise convolution layer, (a) -> (b) reduces the whole network FLOPs 5.3G to 4.6G, due to the significant FLOPs reduction in downsampling residual blocks's shortcut 1 x 1 conv layer. This results in slightly improved performance (80.5% to 80.6%). In the ResNet-200 / Swin-B regime, this step brings even more gain (81.9% to 82.6%) also with reduced FLOPs.
Large Kernel Sizes
One of the most distinguishing aspects of ViTs is their non-local self-attension which enables each layer to have a global receptive field. Although Swin Transformers reintoduced the local window to the self-attention block, the window size is at least 7 x 7, larger than the ResNe(X)t kernel size of 3x3.
Moving up depthwise conv layer
One prerequisite is move up th position of the depthwise conv layer(IMG-2 (b) to (c)). That is a design also evident in Transformers: the MSA block is placed prior to the MLP layers. Also we have that block in inverted bottleneck block. MSA, large-kernel conv will have fewer channels, while the efficient, dense 1x1 layers will do the heavy lifiting. This intermediate step reduces the FLOPs to 4.1G, resulting in a temporary performance degradion to 79.9%.
Increasing the kernel size
Authors experimented with several kernel sizes, including 3, 5, 7, 9, 11. The network's performance increases from 79.9%(3x3) to 80.6%(7x7), while the network's FLOPs stay roughly the same. Additionally authors observe that the benefit of larger kernel sizes reaches a saturation point at 7x7.
Micro Design
This part is focusing on specific choices of activation functions and normalization layer.
Replacing ReLU with GELU
ReLU is still extensively used in ConvNets due to its simplicity and efficiency. ReLU is also used in original Transformer paper. GELU, which can be thought of as a smoother variant of ReLU, is utilized in the most advanced Transformers like Google's BERT and OpenAI's GPT-2. So we can replace ReLU to GELU in our ConvNet, although the accuracy stays unchanged(80.6%).
Fewer activation functions
One minor distinction between a Transformer and a ResNet block is that transformer have fewer activation functions. Transformer block is consist of blocks, that are linear embedding layers with key/query/value, projection layer, two linear layers. There is only one activation function present in the MLP block. So authors remove activation function except for one between two 1 x 1 layers. This process imporves the result by 0.7$ to 81.3%, practically matching the performance of Swin-T.
Fewer normalization layers
Transformer blocks usually have fewer normalization layers as well. Authors remove two BatchNorm layers, leaving only one BN layer before the conv 1 x 1 layers. This further boosts the performance to 81.4%.
Substituting BN with LN
BatchNorm is an essential component in ConvNets as it imporves the convergence and reduces overfitting. On the other hand, the simpler Layer Normalization (LN) has been used in Transformers, resulting in good performance across different application scenarios. Usually just substituting LN for BN in the original ResNet will result in suboptimal performance. But with mordenized Network architecture and training techiniques, ConvNet model doesn't have any difficulties training with LN. Thre performance is slightly better, obtaining an accuracy of 81.5%
Separate downsampling layers
In Resnet, the spatial downsampling is achieved by the residual block at the start of each stage, using 3 x 3 conv with stride 2 (1 x 1 conv with stride 2 at the shortcut connection). In Swin Transformers a separate downsampling layeris added between stages. Authors expore a similar strategy in which we use 2 x 2 conv layers with stride 2 for spatial downsampling. And adding normalization layers whereever spatial resolution is changed can help stablize training. It also used in Swin Transformers: one before each downsampling layer, one after the stem, and one after the final global average pooling. Then Authors can imporve the accuracy to 82.0%.
Closing remarks
ConvNeXt model has approximately the same FLOPs, params, throughput, and memory use as the Swin Transformer but does not require specialized modules such as shifted window attention or relative position biases.
Emrical Evaluations on ImageNet
Authors construct different ConvNext variants, ConvNext-T/S/B/L, to be of simiar complexities to Swin-T/S/B/L. In addition authors build a larger ConvNext-XL to further test the scalabiity of ConvNeXt. The variants only differ in the number of channels C, and the number of blocks B in each stage. Following both ResNets and Swin Transformers, the number of channels doubles at each new stage.
- ConvNeXt-T: C = (96, 192, 384, 768), B = (3, 3, 9, 3)
- ConvNeXt-S: C = (96, 192, 384, 768), B = (3, 3, 27, 3)
- ConvNeXt-B: C = (128, 256, 512, 1024), B = (3, 3, 27, 3)
- ConvNeXt-L: C = (192, 384, 768, 1536) , B = (3, 3, 27, 3)
- ConvNext-XL: C = (256, 512, 1024, 2048), B = (3, 3, 27, 3)
Results
Conclusion
Authors demonstrate ConvNeXt, a pure ConvNet model, can perform as good as a hierarchical ViTs on image classification, object detection, instance and semantic segmentation tasks.
Reference
[IMG-1~5]: https://arxiv.org/pdf/2201.03545.pdf
'Review > - 2D Object Detection' 카테고리의 다른 글
Paper review: YOLOv7 (CVPR workshop 2022) (0) | 2024.04.08 |
---|---|
Paper review: YOLOR (CVPR 2021) (0) | 2024.04.05 |
Paper review: Scaled-YOLOv4(CVPR 2021) (0) | 2023.07.03 |
Paper review: YOLOX(CVPR workshop 2021) (0) | 2023.06.26 |
Paper review: FCOS(ICCV 2019) (0) | 2023.05.25 |