2024. 4. 15. 17:51ㆍReview/- Network
An Image is Worth 16x16 Words: Transformers For Image Recogniton at Scale
Motivation
In vision attention is applied in conjunction with convolutional networks, or used to replace certain components of convolutional networks while keeping their overall structure in place. Authors show CNNs is not necessary and a pure transformer applied directly to sequences of image patchs on image classification tasks. ViT attains excellent results compared to SOTA convolutional networks while requiring substaintially fewer computational resources.
Main Idea
Authors split an image into patches and provicde the sequence of linear embeddings of these pathces as an input to Transformer like tokens in NLP.

Patch & Embedding
Naive application of self-attention to images would require that each pixel attends to every other pixel. With quadratic cost in the number of pixels, this does not scale to realistic input sizes. Thus authors split image in to fixed-size(16x16) patches. And Transformer uses constant latent vector size D, so they flatten the patches and map to D dimensions with a trainable linear projection as patch embeddings. Flatten+Linear projection can be replace to Conv2D(ic=3, oc=768, k=16, s=16)
$x ∈ R^{224x224x3}$ → $ x ∈ R^{14x14x3}$ → $ x ∈ R^{196x768}$

Class Token
Autors prepend a learnable layer to classify image. MLP with one hidden layer is used in pre-training, single linear layer is used in fine-tuning.

Positional Embedding
Position embeddings are added to the patch embeddings to retain positional information. Authors used standard learnable 1D position embedding with broad casting. since they have not obseved significant performance gain s from using more advanced 2D-aware position embeddings.

Transformer Encoder
Original transformer encoder consists of multiheaded self-attention(MSA) and MLP blocks. Residual connections is applied after every block and Layernorm(LN) is applied after residual connections . But LN is applied before every block in ViT.

Inductive bias
Authors note Vision Transformer has much less image-specific inductive bias than CNNs. In CNNs, locality, two-dimensional neighborhood structure, and translation equivariance are baked into each layer throughout the whole model. In ViT, only MLP layers are local and translation equivariant, while the self-attention layers are global.
Hybrid architecture
In hybrid model the patch embedding projection is applied to patches extracted from a CNN feature map. As a apecial case, the patches can have spatial size 1x1, which means that the input sequence is obtained by simply flattening the spatial dimensions of the feature map and projectiong to the Transformer dimension.
Fine-Tuning and Higher Resolution
Authors pre-train ViT on large datasets, and fine-tune to (smaller) downstream tasks. For this they remove the pre-trained prediction head and attach a zero-initialized D x K feedforward layer. It is often beneficial to fine-tune at higher resolution than pre-training with same patch size. But the pre-trained position no longer be meaningful. Therefore they perform 2D interpolation of the pre-trained position embedding. This adjustment and patch extraction are the only points at which an inductive bias abount the 2D structure of the images is manually injected into the ViT.
Experiment
Dataset. They use ImageNet-1K(1.3M), ImageNet-21K(14M), JFT(303M), ReaL, CiFAR-10/100, Oxford-IIIT Pets, Flowers-102, VTAB(19-task)
Comparision to SOTA
ViT have higher accuracy and less computation than CNN based model. Also Computation time did not increase that much.

Pretraining data requirements
We can check Vit is better, if Vit use pre-training dataset, also large dataset has positive effect on ViT.

Scaling study
We can check Vit have less saturation than CNNs and motivating future scaling efforts.

Inspecting ViT
We can check do not need 2D posotional embedding.

Conclustion
Author do not use image-specific inductive biases by using Transformer encoder. Transformer need to large dataset and pre-training but it can be scalable without saturation, and have high accuracy on image classification.
Reference
[Figure-1, 5~8, Table-1]: https://arxiv.org/pdf/2010.11929.pdf
[Figure-5]: https://arxiv.org/pdf/1706.03762.pdf
[Figure-2~4]: Handmade
'Review > - Network' 카테고리의 다른 글
| Paper review: Swin Transformer (ICCV 2021) (0) | 2024.04.19 |
|---|---|
| Paper review: MLP-mixer (NeurIPS 2021) (0) | 2024.04.17 |
| Paper review: Relational Knowledge Distillation(CVPR 2019) (0) | 2024.04.04 |
| Paper review: CSPNet(CVPR 2020) (0) | 2024.04.03 |
| Paper review: RepVGG(CVPR 2021) (0) | 2023.07.24 |