Review(44)
-
Paper review: RTMDet (arxiv 2022)
RTMDet: An Empirical Study of Designing Real-Time Object DetectorsMotivationAuthors aim to exceed the YOLO series named RTMDet, which are also capable to extend instance segmentation and rotated object detection. So this paper propose large-kernel depth-wise convolution, optimization with soft labels in dynamic label assignments, balancing model weight in backbone and neck, adding kernel and mas..
2026.05.14 -
Paper review: YOLOv12 (arxiv 2025, technical report)
YOLOv12: Attention-Centric Real-Time Object DetectorsMotivationYOLO framework has focused on CNN-based improvements despite the proven superiority of attention mechanisms. Because attention-based models cannot match the speed of CNN-based models, this paper proposes an attention-centric YOLO framework, namely YOLOv12 using Area Attention (A2) and Residual Efficient Layer Aggregation Networks (RE..
2025.03.06 -
Paper reivew: Generalized Focal Loss (IEEE Transactions 2023)
Generalized Focal Loss: Towards Efficient Representation Learning for Dense Object DetectionThis paper is published by summarizing paper from v1 (NeurIPS 2020) and v2 (CVPR 2021).MotivationIn object detection, the classification is usually optimized by Focal Loss and the box location is commonly learned under Dirac delta distribution. But Three problems are discovered in existing practices. 1) t..
2025.02.28 -
Paper review: Rewrite the Stars (CVPR 2024)
Rewrite the Stars MotivationSince AlexNet, a myriad of deep networks have emerged, each building on the other. Despite their characteristic instights and contributions, this line of models is mostly based on the blocks that blend linear projection with non-linear activations. Since self-attention, the most distinctive feature of self-attention is mapping features to different spaces and then con..
2025.01.13 -
Paper review: UniTR (CVPR 2023)
UniTR: A Unified and Efficient Multi-Modal Transformer for Bird's-Eye-View Representation MotivationPrevious works handle multi-modal data using modality-specific encoders sequential manner, then fuse the features based on late fusion. It slow down the inference speed and limiting their real-world applications. To tackle these problems, authors propose to process intra-modal representation learn..
2024.11.11 -
Paper review: DSVT(CVPR 2023)
DSVT: Dynamic Sparse Voxel Transformer with Rotated SetsMotivation3D backbone to handle sparse point clouds is a fundamental problem in 3D perception. Compared with the sparse convolution, the attention mechanism in Transformers is more appropriate and is easier to be deployed in real-world applications. However due to the sparse characteristics of point clouds, it is non-trival to apply a stand..
2024.10.18