Paper review: Swin Transformer (ICCV 2021)

2024. 4. 19. 00:59Review/- Network

Swin Transformer: Hierarchical Vision Transformer using Shifted Windows

Motivation

Large variations in the scale of visual entities and the high resolution of pixels in images compared to words in text. To address these differences, authors propose a hierarchical Transformer whose representation is computed with Shifted windows. This shifed windowing scheme brings greater efficiency by limiting self-attention computation to non-overlapping local windows while also allowing for cross-window connection. It has linear computational complexity with respect to image isze.

Main Idea

In this paper, authors seek to expand the applicability of Transformer such that it can serve as a general-purpose backbone for computer vison.

[Figure-1] Differences of Swin Transformer and ViT

Swin Transformer constructs a hierarchical representation by starting from small-sized patches and gradually merging neighboring patches. And Swin Transformer model can conveniently leverage advanced techniques for dense prediction such as FPN or U-Net. The linear computational complexity is achieved by computing self-attention locally with non-overlapping windows. The number of patches in each window is fixed, thus the complexity becomes linear to image size.

A key design element of Swin Transformer is its shift of the window partition between consecutive self-attention layers. The shifted windows bridge the windows of the preceding layer, prociding connections among them. The proposed shifted window approach has much loawer latency than the sliding window method.

Overall Architecture

[Figure-2] architecture of a Swin Transformer

Swin Transformer first splits ans input RGB image into non-overlapping patches. Authors use a patch size of 4x4, thus feature dimension of each patch is 4x4x3= 48. A linear embedding layer is applied on ths raw-valued feature to project it to  an arbitrary dimension (C). The Transformer blocks maintain the number of tokens ($\frac{H}{4}x\frac{W}{4}$) with the linear embedding.

To produce a hierarchical representation, the number of tokens is reduced by patch merging layers. The first patch merging layer concatenates the features of each group of 2x2 neighboring patches, and applies a linear layer on the 4C-dimensional concatenated features. This reduces the number of tokens by a multiple of 2x2=4(2x downsampling of resolution) and the output dimension is set to 2C. 

 

Shifted Window based Self-Attention

The global computation leads to quadratic complexity with respect to the number of tokens, making it unsuitable for many vision problems.

 

Self-attension in non-overlapped windows. For efficient model, they propose to compute self-attention within local windows. The windows are arranged to evenly partition the image in a non-overlappin gmanner. Supposing each window contains $M$x$M$ patches, the complexity will be bellow equation on image of $h$x$w$ patches. 

[Equation-1] Computational Complex on MSA and W-MSA

The former is quadratic to patch number hw, the later is linear when M is fixed.

 

Shifted window partitioning in successive blocks. The window-based self-attention module lacks connections across windows, which limits its modeling power. To introduce cross-window connections while maintaining non-overlapping windows, they propose a shifted window partitioning approach which alternates between two partitioning configurations in consecutive Swin Transformer blocks.

[Figure-4] Shifted window partitioning

The first module uses a regular window partitioning strategy which starts from the top-left pixel and the 8x8 feature map is evenly partitioned into 2x2 windows of size 4x4 (M=4). The next module adopts a windowing configuration that is shifted from that of the preceding layer, by displacing the windows by ($\lfloor\frac{M}{2}\rfloor$, $\lfloor\frac{M}{2}\rfloor$) pixels from the regularly partitoned windows.

 

Efficient batch computation for shifted configuration. An issue with shifted window partitioning is that windows will be smaller than $M$x$M$.

A naive solution is to pad the smaller windows to size of $M$x$M$ and mask out the padded values when computing attention. When the number of windows in regular partitioning is small, increased computation ith this naive solution is considerable.

So they propose cyclic-shifting toward the top-left direction. After this shift, a batched window may be composed of several sub-windows that are not adjacent in the feature map, so amsking mechanism is employed to limit self-attention computation to within each sub-window. With the cyclic-shift, the number of batched windows remains the same.

 

Relative position bias. In computing self-attention authors follow by including a relative position bias $B$ ∈ $R^{{M^2}x{M^2}}$ to each head in computing similarity.

[Equation-2] Operation of Self-Attention

Since the relative position along each axis lies in the range $[-M+1, M-1]$, They parameterize a smaller-sized bias matrix $\hat{B}$ ∈ $R^{(2M-1)x(2M-1)}$. They observe significant improvements over counterparts without this bias term or that use absolute position embedding. The learnt relative position bias in pre-training can be also used to initialize a model for fine-tuning with a different window size through bi-cubic interpolation.

 

Architecture Variants

They build base model, called Swin-B to have of model size and computation complexity similar to ViT-B/DeiT-B. The window size is set to M=7, the query dimension of each head is d=32, and the expansion layer of each MLP is $\alpha$=4.

 

Experiments

[Table-2] Detailed architecture

They conduct experimnet on ImageNet-1k image classification, COCO object detection, and ADE20K semantic segmentation.

Image Classification

They pre-train model use ImageNet-22k. They include most of the augmentation and regularization strategies of in training, except for repeated augmentation and EMA, which do not enhance performance. This is contrary to where repeated augmentation is crucial to stabilize the training of ViT.

[Table-2] Experiments about ImageNet-1K image classification

 

Object Detection on COCO

They utilize the same settings: multi-scale training (resize the input such that the shorter size is between 480 and 800 while the longer side is at most 1333). For system-level comparision, they adopt an improved HTC with instaboost, stringer multi-scale training, softNMS, ImageNet-22K pre-trained model as initialization.

[Table-3] Experiments about COCO object detection

 

Semantic Segmentation on ADE20K

[Table-4] Experiments about ADE20K

 

Shifted windows

The latency overhead by shifted window is also small.

[Table-5] Experiments about shifted windows

 

Relative position bias

They find that inductive bias that encourages certain traslation invariance is still preferable for general-purpose visual modeling, particularly for the dense prediction tasks of object detection, semantic segmentation.

[Table-6] Experiments about relative position

 

Swin MLP-Mixer

[Table-7] Experiments about Swin MLP-Mixer

 

Conclusion

Swin Transformer produces a hierarchical feature representation and has linear computational complexity with respect to input image size. Swin Transformer achieve the SOTA on COCO object detection and ADE20K semantic segmentation.

As a key element of Swin Transformer the shifted window based self-attention.

 

Refference

[Figure-1~4, Table-1~7, Equation-1~2]: https://arxiv.org/pdf/2103.14030.pdf