Paper review: YOLOv7 (CVPR workshop 2022)

2024. 4. 8. 23:39Review/- 2D Object Detection

YOLOv7: Trainable bag-of-freebies sets new state-of the art for real-time object detectors

Motivation

In the recent years the real-time object detector focus on architecture for different edge device or GPU. Authors focus on the optimized moduels and optimization methods which may increasing training cost for improving the accuracy of obejct detection, but without increasing the inference cost. They call these trainable bag-of-freebies. model re-parameterization and dynamic lable assignment have becom important topics in network training and object detection. Authors found some of the new issues about them. So they propose modified model re-parameterization and coarse-to-fine lead guided label assignment.

 

Main Idea

Authors want to design new trainable bag-of-freebies method for the issues derived from SOTA methods associated with a more robust loss function, a more efficient label assignmen method, a more efficient  training method.

Architecture

[Figure-1] Architures of modules

Extended efficient layer aggregation networks

ELAN is disigned by controlling the shotest longest gradient path, a deeper network can learn and converge effectively. 
In this paper authors propose Extended-ELAN (E-ELAN). Regardless of the gradient path length and the stacking number of computational blocks in large-scale ELAN, it has reached a stable state. if these are stacked unlimitedly, stable state will be destroyed. E-ELAN uses expand, shuffle, merge cardinality to achieve the ability to continuously enhance the learning ability of the network. They use cardinality for computational blocks, then the feature map calculated by each computational block will be shuffled in to $g$ groups, and then concatenate them together. the number of channels in each group of feature map will be the same as the number of channels in original architecture. Finally they add $g$ groups of feature maps to perform merge cardinality. As a result E-ELAN can also guide different groups of computational blocks to learn more diverse features.

 

[Figure-2] Model scaling for concatnation-based model

Model scaling for concatenation-based models

The main purpose of model scaling is to adjust some attributes of the model and generate models of different scales to meet the needs of different inference speeds. When PlanNet or ResNet are scaling up or scaling down the in-degree and out-degree of each layer will not shange, while concatenation-based architectures are scaling, these will change. Therefore authors propose the coresponding compounding model scaling metrhod for a concatenation-based model. When we scale the depth factor of blocks, we must also calculate the change of the output channel of blocks. Then, we will perfrom width factor scaling with the same amount of change of the transition layers. As a result model had at the initial design and maintains the optimal structure.

Trainable bag-of-freebies

Planned re-parameterized convolution

[Figure-3] Planned re-parameterized convolution

Although RepConv has achieved exellent performance on the other architectures, it accuracy will be significantly reduced. Authors disigned planed re-parameterized convolution by using gradient flow propagation paths. RepConv actually combines 3x3 conv, 1x1 conv, identity connection in one convolutional layer. But they find that identity connection in RepConv destroys the residual in ResNet and the concatenation in DenseNet. So they use RepConv without identity connection (RepConvN)

 

Coarse for auxiliary and fine for lead loss

Deep supervision is to add extra auxiliary head in the middle layers of the network, and the shallow network weights with assistant loss as the guide on many tasks. Authors call the head responsible for the final output as the lead head, and the others head is called auxiliary head. And then we need to consider "How to assign soft label to auxiliary head and lead head?"

[Figure-4] Architectures about auxiliary head and assigneer

The results of the most popular method at present is (c). But authors propose (d) and (e) for new label assignment method that guides buth auxiliary head and lead head by the lead head prediction.

 

Lead head guided label assigner is mainly calculated based on the prediction result of the lead head and the ground truth, and ground truth, and generate soft label through the optimization process. This set of soft labels will be used as the target training model for both auxiliary head and lead head.

 

Coarse-to-fine lead head guided label assigner also used the predicted result of the lead head and the ground truth to generate soft label. However, in the process we generate two different sets of soft label. fine label is the same as the soft label generated by lead head guided label assigner, and coarse label is generated by allowing more grids to be treated as positive target by relaxing the constraints of the positive sample assignment process. the reason for this is that auxiliary head is not powerful like a lead head and model avoid losing the information for object detection. If the additional weight of coarse label is close to that of fine label, it may produce bad prior at final prediction. Therefore, in order to make those extra coarse positive grids have less impact, they makes the optimizable uppser bound of fine label always higher than coarse label.

 

Other trainable bag-of-freebies

They use three trainable bag-of freebies. These freebies are some of the tricks we used in training from other paper. (1) Conv-bn-activation topology: The purpose of this is to integrate the mean and variance of batch normalization into the bias and weight of convolution layer. (2) Impliciit knowledge in YOLOR combined with convolution feature map in addition and multiplication manner: Implicit knowledge can be simplified to a vector by pre-computing at the inference stage. (3) EMA model: EMA is a technique used in mean teacher, and in their system they use EMA model purely as the final inference model.

 

Experiments

They choose scaled YOLOv4 and YOLOR. From the results proposed model has less parameters, computation, and higher AP. Also proposed model is faster than others.

[Table-1] Comparison of SOTA RT obeject detector

Proposed compound scaling method

Proposed compound scaling method is to scale up the depth of computational block by 1.5 times and the width of transition block by 1.25 times. It can be seen from the results that proposed compound scaling strategy can utilize parameters and computation more efficiently.

[Table-2] Experiment proposed model scaling

Proposed planned re-parameterized model

In order to verify the generality of proposed planed re-parameterized model, they use it on concatenation-based model and residual-based model respectively for verification. They replace the 3x3 convolutional layers in different positions in 3-stacked ELAN with RepConv in concatenation-based model and in residual-based model.

[Table-3] Experiment of concatenation based model
[Table-4] Experiment of residual based model

Proposed assistant loss for auxiliary head

[Figure-5] Objectness map of label assignments

They compare the 4 methods of label assignments. As a result we know coarse-to-fine lead guided is the best method.
And then they compare aux with/without constraint. It means aux deal head with positive sample not in lead head.

[Table-5] Experiment on label assignment

Since the proposed YOLOv7 uses multiple pyramids to jointly predict object detection results, we can directly connect auxiliary head to the pyramid in the middle layer for training. And they need more information, they connect auxiliary head after one of the sets of of feature map before merging cardinality.

[Table-6] Experiment on partial auxiliary head

Conclusions

In this paper they propose a new architecture of real-time object detector and the corresponding model scaling method. For the purpose they found compound scaling, re-parameterized module and dynamic label assignment with auxiliary head. And then authors receive the SOTA results

 

Reference

[Figure-1~5, Table-1~6]: https://github.com/WongKinYiu/yolov7/blob/main/paper/yolov7.pdf

'Review > - 2D Object Detection' 카테고리의 다른 글

Paper reveiw: DETR (ECCV 2020)  (0) 2024.05.10
Paper review: YOLOv9 (arxiv 2024)  (0) 2024.04.16
Paper review: YOLOR (CVPR 2021)  (0) 2024.04.05
Paper review: ConvNeXt(CVPR 2022)  (0) 2023.08.23
Paper review: Scaled-YOLOv4(CVPR 2021)  (0) 2023.07.03