2022. 10. 9. 22:06ㆍReview/- 2D Object Detection
Motivation
The object detection type of R-CNN use two steps
Step-1: region proposal
Step-2: classificaton of proposal
These detector are slow
YOLO is fast but have relatively low accurate
SSD gets high speed and accurate for real-time object detection
Main Idea
SSD is based on a feed-foward convolutional network with bounding boxes and scores of presence of object, NMS step to produce thhe final detections. SSD is consist of base-network(VGG16 in paper) and auxiliary structure
Size | Conv : 3x3x(k(classes + offset)) | Prior Scale |
Aspect Ratios(->k) | Total number of Feature Maps |
|
conv4_3 | 38x38x512 | 3x3x4(1+4) | 0.1 | 1:1, 2:1, 1:2 + α | 5776 |
conv7_2 | 19x19x1024 | 3x3x6(1+4) | 0.2 | 1:1, 2:1, 1:2, 3:1, 1:3 + α | 2166 |
conv8_2 | 10x10x512 | 3x3x6(1+4) | 0.375 | 1:1, 2:1, 1:2, 3:1, 1:3 + α | 600 |
conv9_2 | 5x5x256 | 3x3x6(1+4) | 0.55 | 1:1, 2:1, 1:2, 3:1, 1:3 + α | 150 |
conv10_2 | 3x3x256 | 3x3x4(1+4) | 0.725 | 1:1, 2:1, 1:2, + α | 36 |
conv11_2 | 1x1x256 | 3x3x4(1+4) | 0.9 | 1:1, 2:1, 1:2 + α | 4 |
Total | 8732 |
k:number of default boxes
1. Multi-scale feature maps
SSD have multi-scale feature maps while YOLO have only one. Multi-scale feature maps are consist of 6 feature maps. It makes possible to detect multi-size object. And by replacing FC with Conv, detection speed has been improved.
2 of feature maps from baseline(conv4_3: 38x38, conv7_2: 19x19),
4 of feature maps form extra layer(conv8_2: 10x10, conv9_2: 5x5, conv10_2: 3x3, conv11_2: 1x1)
2. Default boxes and aspect ratios
SSD makes default bounding box each cells from feature map using different scale and aspect ratio. SSD predict the offsets relative to the default box and the per-class scores that indicate the presence of a class.
scale of conv3_3 is 0.1 while rest is linearly increasing from 0.2 to 0.9
$s_{k}$ : scale of default box from ratio of image, $a_{r}$ : aspect ratio of default box
m | 5 (nubmer of feature maps -1) |
$s_{min}$ | 0.2 |
$s_{max}$ | 0.9 |
$a_{r}$ | ∈[1,2,3,1/2,1/3] |
$w^{a}_{k}$ | $s_{k} sqrt{a_{r}}$ |
$h^{a}_{k}$ | $s_{k}/sqrt{a_{r}}$ |
$s'_{k}$ ($a_{r}$=1) | $/sqrt{s_{ks}_{k+1}}$ |
The smaller feature map size, the lager objects can be detected.
3. Predictions
Each feature map choose different k, that k is fixed from aspect ratio.
SSD makes outputs from 6 kind of feature maps using Conv(3x3x(kx(c+offset))). each output has (c+offset)kmn shape for mxn feature map.
For example) Let choose conv9_2, it have 5x5 size of feature map, 6 kinds of default boxes, k=6, $s_{4}$=0.55, $s'_{4}$=1.129, c=21(20 classes, 1 background). So output size is 5x5x6(21+4)=150
4. Matching strategy
We need to determine which default boxes correspond to a ground truth detection in training. So we begin by matching each ground truth to the default box with best jaccard overlap(=IoU) and jaccard overlap highter than a threshold(=0.5). These are positive label. On the other hand jaccard overlap lower than a threshold are negative label.
After the matching step, most of the default boxes ar negatives. This introduce a significant imbalance between the positive and negative training examples, so we sort them using the highest confidence loss for each default box and pick the top ones. That ratio between negatives and positives is at most 3:1. This leads to faster optimization and a more stable training.
5. Loss function
$x^{p}_{ij}$ | label in i_th default box to j_th ground truth box of category p | [1,0] |
N | number of default box | |
α | balancing parameter for both losses | 1 |
l | predicted box | |
g | ground truth box | |
d | default bounding box | |
cx, cy | center of the box | |
w, h | width, height | |
$L_{loc}$ | $smooth_L1$ | |
$L_{conf}$ | softmax loss |
6. Data augmentation
To make the model more robust to various input object sizes and shapes, each training image is randomly sampled.
- Entire original input image
- Add randomly sampling patch so that the minimum jaccard overlap with the object [0.1, 0.3, 0.5, 0.7, 0.9]
- Aspect ratio is betwwen 1/2 and 2
- Horizeontally flip with probability of 0.5
- apply some photo-metric distortions
Experiment
Experiment explain two important things
1. Data augmentation helps to improve the performance
2. SSD is hard to detect small object
Reference
[IMG]: https://arxiv.org/pdf/1512.02325.pdf
[Study]: https://github.com/sgrvinod/a-PyTorch-Tutorial-to-Object-Detection
'Review > - 2D Object Detection' 카테고리의 다른 글
Paper review: YOLOv4(CVPR 2020) (0) | 2023.04.18 |
---|---|
Paper review: YOLOv3(arxiv 2018) (0) | 2022.11.27 |
Paper review: YOLOv2(CVPR 2017) (0) | 2022.10.20 |
Paper review: YOLOv1(CVPR 2016) (0) | 2022.10.01 |
Paper review: Faster R-CNN(NeurIPS 2015) (0) | 2022.09.22 |