Paper review: YOLOv3(arxiv 2018)

Paper review: YOLOv3(arxiv 2018)

2022. 11. 27. 02:13ㆍReview/- 2D Object Detection

Motivation

They managed to make some improbements to YOLO. But nothing like super interesting, just a bunch of small changes that make it better. They want to tell others about some things that are working or not.

Main Idea

Bounding Box Prediction

YOLO_v2 predicts bounding boxes using dimension cluster as anchor boxes. During training YOLO_v2 use sum of squared error loss.

[IMG-1] Bouinding boxes with dimension priors and location

YOLO_v3 compute ground_truth box to $t_{*}$ for coordinate prediction. YOLO_v3 predicts an objectness score for each bounding box using logistic regression. Also YOLO_v3 select best prior box that have ious more than threshold with ground thruth box. If abounding box prior is not assigned to a ground truth object it incurs no loss for coordinate or class predictions, only objectness.

Class Prediction

YOLO_v2 use a softmax but YOLO_v3 do not use a softmax. Bacause softmax is unnecessary for good performance.
In other words softmax is not suitable when model predict multi-label classification. So YOLO_v3 use binary cross-entropy loss. Especially many overlapping labels exist in box.

Predictions Across Scales

YOLO_v3 predicts boxes at 3 different scales. These are first, second, thrid layer form behind. YOLO_v3 predict 3 boxes at each scale so the tensor is NxNx[anchors(3)*(4+1+classes)]. In this step YOLO_v3 use 1x1, 2x3 conv layer.

This method allows us to get more meaningfull semanticinformation and fine-grained information.

YOLO_v3 still use k-means clustering to determine bounding box priors.

Feature Extractor

YOLO_v3 use Darknet-53 for backbone network. Darknet-53 uses succesive 3x3 and 1x1 convolutional layers, shortcut connections. residual block use addition not concatenate.

Training

YOLO_v3 use multi-scale training, lots of data augmentation, batch normalization, all the standard stuff.

YOLO_v3 extract 3 kinds of feature maps and each feature maps mean 3 kinds of object scale. It is simular to FPN

Loss Function

loss fucntion is almost same YOLO_v2 withoud $t_{*}$, BCE.

- MSE of bounding box offset

- BCE of objectness score

- BCE of no objectness score

- multi-class BCE of bounding box

Inference

mAP of YOLO_v3 is less than RetinaNet. But inference time of YOLO_v3 is three times faster than RetinaNet.

Addition

Author emphasize two things. First is mAP is not accurate evaluation standard. Second is computer vision research is used in military and horrible things. So at least we have a responsiblility.

Reference

[IMG-1, 3, 5]: https://arxiv.org/pdf/1804.02767.pdf

[IMG-2, 4]: https://towardsdatascience.com/dive-really-deep-into-yolo-v3-a-beginners-guide-9e3d2666280e

[Implementation]: https://github.com/kongbuhaja/YOLO_v3

'Review > - 2D Object Detection' 카테고리의 다른 글

Paper review: FCOS(ICCV 2019) (0)	2023.05.25
Paper review: YOLOv4(CVPR 2020) (0)	2023.04.18
Paper review: YOLOv2(CVPR 2017) (0)	2022.10.20
Paper review: SSD(ECCV 2016) (0)	2022.10.09
Paper review: YOLOv1(CVPR 2016) (0)	2022.10.01

find-knowledge