Paper review: YOLOv1(CVPR 2016)

2022. 10. 1. 21:56Review/- 2D Object Detection

Motivation

The object detection type of R-CNN(like Faster-RCNN...) use least two steps

Step-1: generate potential bounding box

Step-2: run classifier on proposed boxes

These complex pipelines are slow and hard to optimize

It makes models train separately

Main Idea

Model predicts multiple bounding boxes and class probabilites

[IMG-1] YOLO System

1. Unified Detection

YOLO uses features from the entire image, and predict bounding boxes and classes simultaneously

[IMG-2] Predict bounding box

1. Divides the input image into S x S grid.

2. If the center of an object falls into a grid cell, that  grid cell is responsible.

3. Each grid cell predicts B bounding boxes and confidence scores.

    - confidence score reflect how confident the box contains an object.

4. Each bounding box consists of 5 predictions: x, y, w, h, confidence.

5. Each grid cell also predicts C conditional class probabilites(=Pr(Class|Object)).

    - preidictions have S x S x (B * 5 + C).

6. Select bounding box using NMS.

S Num of grid cell each width, height B Num of bounding box in each grid cell
Pr() probability(0~1) IoU Intersection of Union
(more: https://find-knowledge.tistory.com/2)
Confidence Pr(Object) * IoU NMS non-max suppression
(more: https://find-knowledge.tistory.com/2)
(x, y) Center coordinates of bounding box
(relative to the grid cell)
(w, h) The width and height relative to whole iamge

S=7, B=2, C=20, dataset=Pascal VOC in paper 

2. Network Design

[IMG-3] YOLO Architecture

Network has 24 convolutional layers, 2 fully connected layers. (fast version use 9 conv layers)
Network use 1 x 1 conv layer followed by 3 x 3 conv layer like inception concept.

Final output of network is 7 x 7 x 30.

2-1. Pretraining

Pretraining use the first 20 convolutional layers followed avg-pooling layer and fc layer.

Using ImageNet dataset for classification.

2-2. Training

After pretraining remove last 2 layer, add 4 conv layers and 2 fc layer with randomly initialized weights.

Increase the input resolution of the network 224x224 -> 448x448

Final layer predicts class probabilities and bouding box coordinates.

Weight and height of bounding box are nomalized by the image width and height. (0~1)

Parametrize the bounding box x and y coordinates to be offsets of a particular grid cell location. (0~1)

YOLO use a linear activation fuction for final layer, other layer use leaky ReLU

[IMG-4] Leaky ReLU

2-3. Loss

YOLO is based on SSE(sum-squared error)

But SSE dose not perfectly equal to maximize average precision with two problem.

1. It weights localiztion error equally with classification error which may not be ideal.

2. Many grid cells do not contain any object, this makes confidence score to zero

    So alomst grid cells train for confidence score=zero, it makes model instability

 

To remedy this, YOLO increase localization loss and decrease confidence loss.

YOLO set weights $λ_{coord}$=5, $λ_{noobj}$=0.5.

 

SSE has another problem that equally weights errors in large boxes and small boxes.

SSE reflect that small deviations in small boxes more sensitive than in large boxes.

YOLO use square root of the bounding box width and height instead of width and height directly

 

YOLO predicts multiple bounding boxes per grid cell. At training time YOLO only want one bounding box predictor each object. YOLO assign one prediction has the highes IoU.

[IMG-5] Multi-part loss fuction

$1_{i}^{obj}$ Valuable of object exists in ith cell
(1 or 0)
$λ_{coord}$ Constant of balancing with coordinates loss and
classification loss (=5)
$1_{ij}^{obj}$ Valuable of jth bounding box is reponsible 
in 
ith cell (1 ro 0)
$λ_{noobj}$ Constant of balancing with obj box and no obj box (=0.5)

① compute coordinate loss of jth bounding box in ith cell (object exist)

② compute size loss of jth bounding box in ith cell (object exist)

③ compute confidence score loss (object exist, Ci=1)

④ compute confidence score loss (object not exist, Ci=1)

⑤ compute conditional class probability loss (object exist, correct class c: Pi(c)=1, otherwise: Pi(c)=0)

 

2-4. Hyper parameters

batch_size 64 learning_rate 1e-3 ~ 1e-2
momentum 0.9 (decay of 5e-4) 1e-2 (for 75 epochs)
epochs 135 1e-3 (for 30 epochs)
dropout ratio 0.5 1e-4 (for last 30 epochs)

Data augmentation: randomly scaling and translations of up to 20% of the original image size

Activation function: a linear activation fuction for final layer, other layer use leaky ReLU

 

2-5. Inference

Just like in training, predicting detections requires one network evaluation.

On Pascal Voc the network predict 98(7*7*2) bounding boxes per image and class probabilies for each box.

So YOLO is extremely fast at test time.

Some large objects or objects near the border of multiple cells can be well localized by multiple cells.

NMS can be used to fix these multiple detections (2~3% increasing mAP in YOLO)

 

3. Limitations of YOLO

1. YOLO imposes stroing spatial constraints since each grid cell only predicts two boxes and one class.

    YOLO can't predict number of nearby small objects because of this spatial constraints .

2. YOLO learns to predict bounding boxes from data.

    it struggles to generalize to object in new or unusual aspect ratio or configurations.

3. Loss fuction treats errors the same in small bouding boxes and large bounding boxes.

    a small error in a small box has more greater effect on IoU than large box.

Experiments

[IMG-6] Real-Time System on Pascal VOC 2007                                            [IMG-7] error Analysis: Fast R-CNN vs. YOLO           

Correct:  correct class and IoU > 0.5 Localization:  correct class, 0.1 <IoU < 0.5
Similar:  class is similar, IoU > 0.1 Other:  class is wrong, IOU > 0.1 
Background:  IOU < 0.1 for any object  

 

[IMG-7] Picasso dataset precision-recall curves                                              [IMG-8] Quantitative results on each dataset                                              

 

Reference

[IMG-ALL]: https://arxiv.org/pdf/1506.02640.pdf

[STUDY]: https://docs.google.com/presentation/d/1aeRvtKG21KHdD5lg6Hgyhx5rPq_ZOsGjG5rJ1HP7BbA/pub?start=false&loop=false&delayms=3000&slide=id.p, https://www.youtube.com/c/Deeplearningai

[Implementation]: https://github.com/kongbuhaja/YOLO_v1

'Review > - 2D Object Detection' 카테고리의 다른 글

Paper review: YOLOv4(CVPR 2020)  (0) 2023.04.18
Paper review: YOLOv3(arxiv 2018)  (0) 2022.11.27
Paper review: YOLOv2(CVPR 2017)  (0) 2022.10.20
Paper review: SSD(ECCV 2016)  (0) 2022.10.09
Paper review: Faster R-CNN(NeurIPS 2015)  (0) 2022.09.22