Paper review: SSD(ECCV 2016)

2022. 10. 9. 22:06Review/- 2D Object Detection

Motivation

The object detection type of R-CNN use two steps

Step-1: region proposal

Step-2: classificaton of proposal

These detector are slow

YOLO is fast but have relatively low accurate

SSD gets high speed and accurate for real-time object detection

Main Idea

SSD is based on a feed-foward convolutional network with bounding boxes and scores of presence of object, NMS step to produce thhe final detections. SSD is consist of base-network(VGG16 in paper) and auxiliary structure

[IMG-1] architecture of SSD and YOLO

  Size Conv : 3x3x(k(classes + offset)) Prior
Scale
Aspect Ratios(->k) Total number of
Feature Maps
conv4_3 38x38x512 3x3x4(1+4) 0.1 1:1, 2:1, 1:2 + α 5776
conv7_2 19x19x1024 3x3x6(1+4) 0.2 1:1, 2:1, 1:2, 3:1, 1:3 + α 2166
conv8_2 10x10x512 3x3x6(1+4) 0.375 1:1, 2:1, 1:2, 3:1, 1:3 + α 600
conv9_2 5x5x256 3x3x6(1+4) 0.55 1:1, 2:1, 1:2, 3:1, 1:3 + α 150
conv10_2 3x3x256 3x3x4(1+4) 0.725 1:1, 2:1, 1:2, + α 36
conv11_2 1x1x256 3x3x4(1+4) 0.9 1:1, 2:1, 1:2 + α 4
Total 8732

k:number of default boxes

1. Multi-scale feature maps

SSD have multi-scale feature maps while YOLO have only one. Multi-scale feature maps are consist of 6 feature maps. It makes possible to detect multi-size object. And by replacing FC with Conv, detection speed has been improved.

2 of feature maps from baseline(conv4_3: 38x38, conv7_2: 19x19),

4 of feature maps form extra layer(conv8_2: 10x10, conv9_2: 5x5, conv10_2: 3x3, conv11_2: 1x1)

 

2. Default boxes and aspect ratios

SSD makes default bounding box each cells from feature map using different scale and aspect ratio. SSD predict the offsets relative to the default box and the per-class scores that indicate the presence of a class. 

scale of conv3_3 is 0.1 while rest is linearly increasing from 0.2 to 0.9

[IMG-2] scale of default boxes

$s_{k}$ : scale of default box from ratio of image, $a_{r}$ : aspect ratio of default box

m 5 (nubmer of feature maps -1)
$s_{min}$ 0.2
$s_{max}$ 0.9
$a_{r}$ [1,2,3,1/2,1/3]
$w^{a}_{k}$ $s_{k} sqrt{a_{r}}$
$h^{a}_{k}$ $s_{k}/sqrt{a_{r}}$
$s'_{k}$ ($a_{r}$=1) $/sqrt{s_{ks}_{k+1}}$

The smaller feature map size, the lager objects can be detected.

 

3. Predictions

Each feature map choose different k, that k is fixed from aspect ratio.

SSD makes outputs from 6 kind of feature maps using Conv(3x3x(kx(c+offset))). each output has (c+offset)kmn shape for mxn feature map.

For example) Let choose conv9_2, it have 5x5 size of feature map, 6 kinds of default boxes, k=6, $s_{4}$=0.55, $s'_{4}$=1.129, c=21(20 classes, 1 background). So output size is 5x5x6(21+4)=150

 

4. Matching strategy

We need to determine which default boxes correspond to a ground truth detection in training. So we begin by matching each ground truth to the default box with best jaccard overlap(=IoU) and jaccard overlap highter than a threshold(=0.5). These are positive label. On the other hand jaccard overlap lower than a threshold are negative label.

 

After the matching step, most of the default boxes ar negatives. This introduce a significant imbalance between the positive and negative training examples, so we sort them using the highest confidence loss for each default box and pick the top ones. That ratio between negatives and positives is at most 3:1. This leads to faster optimization and a more stable training.

 

5. Loss function

 

[IMG-3] Total loss
[IMG-4] Localization loss                                                                                         [IMG-5] confidence loss

$x^{p}_{ij}$ label in i_th default box to j_th ground truth box of category p [1,0]
N number of default box  
α balancing parameter for both losses 1
l predicted box  
g ground truth box  
d default bounding box  
cx, cy center of the box  
w, h width, height  
$L_{loc}$ $smooth_L1$  
$L_{conf}$ softmax loss  

 

6. Data augmentation

To make the model more robust to various input object sizes and shapes, each training image is randomly sampled.

- Entire original input image

- Add randomly sampling patch so that the minimum jaccard overlap with the object [0.1, 0.3, 0.5, 0.7, 0.9]

- Aspect ratio is betwwen 1/2 and 2

- Horizeontally flip with probability of 0.5

- apply some photo-metric distortions

 

Experiment

          [IMG-6] Effect of data augmentation                                                                   [IMG-7] Hard to detect small object

Experiment explain two important things

1. Data augmentation helps to improve the performance

2. SSD is hard to detect small object

 

Reference

[IMG]: https://arxiv.org/pdf/1512.02325.pdf

[Study]: https://github.com/sgrvinod/a-PyTorch-Tutorial-to-Object-Detection

'Review > - 2D Object Detection' 카테고리의 다른 글

Paper review: YOLOv4(CVPR 2020)  (0) 2023.04.18
Paper review: YOLOv3(arxiv 2018)  (0) 2022.11.27
Paper review: YOLOv2(CVPR 2017)  (0) 2022.10.20
Paper review: YOLOv1(CVPR 2016)  (0) 2022.10.01
Paper review: Faster R-CNN(NeurIPS 2015)  (0) 2022.09.22