Paper review: YOLOv9 (arxiv 2024)

2024. 4. 16. 13:45Review/- 2D Object Detection

YOLOv9: Learning What You Want to Learn Using Programmable Gradient Information

Motivation

An appropriate architecture that can facilitate acquisition of enough information for prediction has to be designed. Existing methods ignore a fact that when input data undergoes layer-by-layer feature extraction and spatial transformation, large amount of information will be lost. Authors proposed the concept of programmable gradient information (PGI) that make reliable gradient information to update network weights. In addition they designed Generalized Efficient Layer Aggregation Network (GELAN), based on gradient path planning. 

Main Idea

Programmable Gradient Information(PGI)

[Figure-1] PGI and related network architectures

Authors popose a new axiliary supervision framework called Programmable Gradient Information (PGI). PGI includes thre components namely (1) main branch, (2) auxiliary reversible branch, (3) multi-lavel auxiliary information. PGI only uses main branch and therefore does not require any additional inference cost.

 

Auxiliary Reversible Branch. Authors propose reversible branch to generate reliable gradients and update network parameters by using reversible architecture, but adding main branch to reversible architecture will consume a lot of inference costs. Also proposed method can be applied to shallower networks. Because it generate useful gradient through the auxiliary supervision mechanism. Finally auxiliary reversible branch can be removed in inference step without performance decreasing.

 

Multi-level Auxiliary Information. For object detection, different feature pyramids can be used to perfrom different tasks, for example together they can detect object of different sizes. Therefore, after connecting to the deep supervision branch the shallow features will be guided to learn the features for only small objects. The concept of multi-level auxiliary information is to insert an intergration between the feature pyramids from auxiliary supervision and main branch, and then uses it to combine gradients from different prediction heads. Multi-level auxiliary information is then to aggregate the gradient information containing all target objects, and pass it to the main branch and then update parameters for all objects. As a result this method can alleviate the broken information problem in deep supervision.

Generalized ELAN

[Figure-2] Architectures of CSPNet, ELAN, GELAN

Authors proposed new network architeture - GELAN, by combining two neural network architectures, CSPNet and ELAN, which are designed with gradient path planning. It takes into account lightweight, inference speed, and accuracy.

 

Experiments

All YOLO models are trained from scratch strategy with 500 epochs. YOLOv9 is based on YOLOv7 and Dynamic YOLOv7. They used GELAN with RepConv instead of ELAN. Also they use simplified downsampling module and optimized anchor free prediction head.

Comparison of SOTA

[Table-1] Comparison of SOTA
[Figure-3] Comparison of RT-SOTA

Proposed YOLOv9 has significantly imporved in all aspects comapred with existing methods. Even contain pretrained object detector.

Generalized ELAN

They compare various block with GELEN. Almost of block show good performance. Best option is GELEN with CSPblock.

[Table-2] Experiments of GELAN

Also we can see that GELAN is not sensitive to the depth, So users can arbitrarily combine the components in GELAN to design the network architecture

Programmable Gradient Information

This parts show effects on auxiliary reversible branch and multi-level ausiliary information on the backbone and neck.

[Table-3] Experiments of PGI

The concept of PGI brings two valuable contributions. First one is to make the auxiliary supervision method applicable to shallow models. Second one is to make the deep model training process obtain more reliable gradients.

YOLOv9-E

We can check effects of various method by using YOLOv9-E

[Table-4] Experiments of various method

Visualization

This part will explore the information bottleneck issues and visualizae them. and visualize how the PGI uses reliable gradients to find correct correlations between data and targets.

[Figure-4] Feature maps from various network

We can see that proposed GELAN can still retain quite complete information and PGI can provide more reliable gradients during training process.

 

Conclusion

Authors propose to used PGI to solve information bottleneck problem and deep supervision mechanism is not suitable for lightwieght networks. So they designed GELAN for highly efficient and lightweight neural network. As a result YOLOv9 achieve SOTA of real-time obejct dector.

 

Reference

[Figure-1~4, Table-1~4]: https://arxiv.org/pdf/2402.13616.pdf

[Figure-5]: Handmade

[Seminar ppt]:

YOLOv9.pptx
4.43MB

Issue

When i read github code and model.yaml file, i found something weird architecture. So i think we need to cross check about this.

[Figure-5] Architecture of YOLOv9

Left figure show yolov9-c architecture, center figure show yolov9-e architecture, when i analyze github code. But yolov9-e has different architecture from paper. So When i think right figure is right architecture fo yolov9-e. I didn't test experiments about architecture not yet, but we need to discuss this parts.

[Issue]: https://github.com/WongKinYiu/yolov9/issues/192