Paper review: YOLO-World (CVPR 2024)

2024. 5. 23. 02:24Review/- 2D Object Detection

YOLO-World: Real-Time Open-Vocabulary Object Detection

Motivation

Traditional Object Detections are trained on datasets with pre-defined categories, like coco dataset and object 365 dataset, and then detect object within the fixed set of categories. So open vocabulary object detection is proposed to detect objects beyond the predefined categories. But recent models have heavy models and low inference speed. Therefore authors propose YOLO-World to improve efficiency and capability.

[Figure-1] Object Detector

Main Idea

Traditional object detection methods are trained with instance annotations $\Omega = \{B_{i}, c_{i}\}^{N}_{i=1}$, which consist of bounding boxes $\{B_{i}\}$ and category labels $\{c_{i}\}$. In ths paper they reformulate the instance annotations as region-text pars $\Omega = \{B_{i}, t_{i}\}^{N}_{i=1}$, , where $t_{i}$ is the corresponding text for the region $B_{i}$. $t_{i}$ can be the category name, noun phrases or object descriptions. Moreover YOLO-World adopts both image $I$ and texts $T$ as input and output predicted boxes $\{\hat{B_k}\}$ and the correspondin object embeddings $\{e_k\}$ $(e_k$ ∈ $R^D)$.

Model Architecture

[Figure-2] YOLO-World architecture

YOLO-World consists of  a YOLO detector, Text Encoder, Re-parameterizable Vision-Language Path Aggregation Network. Given the input text, the text encoder in YOLO-World encodes the text in to text embeddings, The image encoder in the YOLO detector extracts the multi-scale features from the input image. Then YOLO-model leverage the RepVL-PAN to enhance both txt and image representation by exploiting the cross-modality fusion between image features and text embeddings.

Text Encoder

Given the text T, authors adopt the CLIP text encoder to extract the corresponding text embeddings $W = TextEncoder(T)$ ∈ $R^{CxD}$, where $C$ is the number of nouns and D is the embedding dimension. When the input text is a caption or referring expression, they adopt the simple n-gram algorithm to extract the noun phrases and then feed them into the text encoder. During training, they construct an online vocabulary T for image. At the inference stage, they present a prompt-then detect strategy with an offline vocabulary for futher efficiency. They utilize the text encoder to encode these prompts and obtain offline vocabulary embeddings. The offline vocabulary allows for avoiding computation for each input and provides the flexibility to adjust the vocabulary as needed.

 

Re-parameterizable Vision-Language PAN

[Figure-3] Re-parameterizable Vision-Language Path Aggregation Network

They propose the Text-guided CSP Layer (T-CSPLayer) and Image-Pooling Attention (I-Pooling Attention) to enhance the interaction between image features and text features, which can improve the visual-semantic representation for open-vocabulary capability. During inference, the offline vocabulary embeddings can be re-parameterized into weights of convolutional or linear layers for deployment.

 

Text Contrastive Head

[Figure-4] Text Contrastive Head

They adopt the decoupled head with two 3x3 convs to regress bounding boxes $\{b_k\}^K_{k=1}$ and object embeddings $\{e_k\}^K_{k=1}$, where K denotes the number of objects. They present a text contrastive head to obtain the object-text similarity $s_{k,j}$ by: $s_{k,j} = \alpha *L2Norm(e_k)*L2Norm(W_j)^T + \beta $, where $w_j$ ∈ W is the $j$-th text embeddings. In addition they add the affine transformation with learnable scaling factor $\alpha$ and shifting factor $\beta$. Both the L2 norms and the affine transformations are important for stabilizing the region-text training.

 

Region-Text Contrastive Loss

Givim image I and texts T, YOLO-World outputs K object predictions $\{B_k, s_k\}^K_{k=1}$ along with annotations $\Omega = \{B_i, t_i\}^N_{i=1}. They construct the region-text contrastive loss $L_{con}$ with region-text pairs through cross entropy between object-text(region-text) similarity and object-text assignments. In addition, they adopt IoU loss and distributed focal loss for bounding box regression and the total training loss is defined as: $L(I) = L_{con} + \lambda_I * (L_{iou} + L_{dfl})$, where $\lambda_I$ is an indicator factor and set to 1 when input image $I$ is from detection or grounding data and set to 0 when it is from the image-text data. 

 

Pseudo Labling with Image-Text Data

They propose an automatic labeling approach to generate region-text paris from image-text pair. The labeling approach contains three steps: (1) extract noun phrases: they first utilize the n-gram algorithm to extract noun phrases from the text; (2) pseudo labeling: they adopt a pre-trained open-vocabulary detector like GLIP, to generage pseudo boxes for the given noun phrases for each image (3) filtering: they employ the pre-trained CLIP to evaluate the relevance of image-text pairs and region-text pairs and filter the low-relevance pseudo annotations and images. They further filter redundant bounding boxes by incorporating methods such as NMS.

Experiments

They demonstrate the effectiveness of the proposed YOLO-World by pre-training it on large-scale datasets and evaluating YOLO-World in a zero-shot manner on both LVIS benchmark and coco benchmark. And also evalute the fine-tuning performance of YOLO-World on COCO, LVIS for object detection

[Figure-5] examples of model output

Pre-training

For pre-training YOLO-World, they mainly adopt detection or grounding datasets including Object 365, GQA, Flickr30k. In addition they also extend the pre-training data with image-text pairs $i.e.$, CC3M.

[Table-1] Pre-training datasets, [Figure-6] Examples of pre-training datasets

Zero-shot Evaluation

After pre-training, they directly evaluate the proposed YOLO-World on the LVIS dataset in a zero-shot manner.

[Table-2] Zero-shot Evaluation on LVIS minival dataset

 

Fine-tuning YOLO-World on COCO 

They compare the pre-trained YOLO-World with previous YOLO detectors. For fine-tuning YOLO-World on the COCO dataset, they remove the RepVL-PAN for further acceleration considering that the vocabulary size of the COCO dataset is small. 

[Table-3] Comparision with YOLOs on COCO Object Detection

 

Fine-tuning YOLO-World on LVIS

They evalute the fine-tuning performance of YOLO-YOLO-World on the standard LVIS dataset. Firstly, compared to the oracle YOLOv8s trained on the full LVIS datasets, YOLO-World achieves significant improvements, especially for larger models. The improvements can demonstrate the effectiveness of the proposed pre-training strategy for large-vocabulary detection, outperforming previous sota two-stage methods.

[Table-4] results on LVIS object detection and instance segmentation

Conclusion

Authors present YOLO-World, real-time open-vocabulary detector to imporve efficiency and capability. So They propose offline vocabulary in inference, utilize rep VL-PAN to interact text, image features, utilize YOLO to extract feature

 

Reference

[Figure-1~6, Table-1~4]: https://arxiv.org/pdf/2401.17270