2023. 5. 25. 02:46ㆍReview/- 2D Object Detection
FCOS: Fully Convolutional One-Stage Object Detection
Motivation
Almost previous all SOTA object detector use predined anchors. It need to complicated computation such as calculating overlapping during training, and many hyper-parameters for anchor boxes. By eliminated these issues, FCOS can be a object detector in SOTA.
Main Idea
Weakness of anchor based detector
Author purpose anchor based detector have 4 weakness on object detection
1. Performance of anchor based detector is sensitive to the sizes, aspect ratios, number of anchor boxes.
2. Even fixed scales and aspect ratios, detectors encounter difficulties with large shape variations like small objects.
3. Anchor based detector need densely prediction boxes for high recall, it increase too many calculation.
4. Also anchor based detector need another complicated computations such as IoU scores with ground truth bounding boxes.
By eliminating anchors, we can remove 4 weakness.
Advantages of anchor free detector
1. Detection is noew unified with many other FCN-solvable tasks, making it easier to re-use ideas fromthose tasks.
2. Detection becomes proposal free and anchor free, which reduces the number of design parameters. The design parameters typically need heurisic tunning and many tricks.
3. By eliminating the anchor boxes, we can remove complicated computation such as IoU.
4. It can replace RPN in two-stage detectors, and get better performance.
5. It can be immediately extended to solve other vision tasks with minimal modification, like instance segmentation, keypoints detection.
Anchor free detector
Anchor free detector takes advantages of all points in a ground truth bounding box to predict the bounding boxes but this have low quality detected bounding boxes or requires much more complicated post processing like YOLOv1, cornerNet. Authors purpose center-ness branch. As a result, FCOS is able to provide comparable recall with anchor-based detectors.
Fully Convolutional One-Stage Object Detector
feature maps at layer i | $F_{i} \in R^{H × W × C}$ |
Ground truth bounding boxes | $B_{i} = (x_{0}^{i}, y_{0}^{i}, x_{1}^{i}, y_{1}^{i}, c^{i})$ $\in R^{4} × {1, 2, ... C}$ |
Left-top, Right-bottom | $(x_{0}^{i}, y_{0}^{i})$, $(x_{1}^{i}, y_{1}^{i})$ |
Class of object | $c^{i}$ |
Number of classes | $C$ |
Stride of the layer | $s$ |
For each location $(x, y)$ on the feature map $F_{i}$, we can map it back onto the input image as $(\lfloor \frac{s}{2} \rfloor + xs, \lfloor \frac{s}{2} \rfloor + ys)$, which is near the center of the receptive field of the location $(x, y)$.
Location $(x, y)$ is condsidered as a positive sample if it falls into any ground-truth box and the class label $c*$ of the location is the class label of the ground truth box. Otherwise it is a negative sample if it and $c* = 0$(backbround). FCOS also have a 4D real vector $t* = (l*, t*, r*, b*)$ for regression. If a location falls into multiple bounding boxes, it is considered as an ambiguous sample. We simply choose the bounding box with minimal area as its regression target.
It is worth noting that FCOS can leverage as many foreground sample as possible to train the regressor.
Network Outputs
FCOS predict $N$D(coco: 80) vector p of classification labels and 4D vector t = (l, t, r, b) bounding box coordinate. FCOS have four convolutional layers after the feature maps of the backbone networks respectively for classification and regression branches. Moreover since the regression targets are always positive, so compute $exp(x)$ to map any real number to (0, $\infty$) on the top of regression branch.
Loss Function
FCOS use loss function in training as follows
$L_{cls}$ is focal loss and $L_{reg}$ is the IOU loss. $N_{pos}$ denotes the numver of posivite samples and $\lambda$ being 1 in this paper is the balace weight for $L_{reg}$. if $c_{x,y}^{*}$ > 0 being 1 and 0 otherwise.
Inference
The inference of FCOS obtain the classification scores $p_{x,y}$ and the regression prediction $t_{x,y}$ for each location on the feature maps $F_{i}$. Authors choose the location with $p_{x,y}$ > 0.05 as positive samples.
Multi-level Prediction with FPN for FCOS
Here author how that how two possible issues of FCOS can be resolved with multi-level prediction with FPN.
1) The large stride of the final feature maps in a CNN can result low BPR.
Actually BPR produce a good BPR, even FCOS use Multi-level Prediction with FPN. So FCOS can produce a better BPR than anchor based detector.
2) Overlaps in ground truth boxes can cause intractable ambiguity.
This ambiguity results in degraded performance of FCN based detectors, but it can be resolved with multi-level prediction.
FCOS make use of five levels of feature maps defined as {${P_{3}, P_{4}, P_{5}, P_{6}, P_{7},}$}. $P_{3}, P_{4}, P_{5}$ are produced by backbone CNNs' feature maps $C_{3}, C_{3}, C_{3}$ followed by a 1 x 1 convolutional layer with the top-down connections. $P_{6}, P_{7}$ are produced by applying one convolutional layer with the stride being 2 on $P_{5}, P_{6}$, respectively. As a result, the feature lavels hav strides 8, 16, 32, 64, 128 respectively.
FCOS compute thre regression target $l^{*}, t^{*}, r^{*}, b^{*}$ for each location on all feature labels. If a location satisfies $max(l^{*}, t^{*}, r^{*}, b^{*}) > m_{i}$ or $max(l^{*}, t^{*}, r^{*}, b^{*}) < m_{i-1}$, it is set as a negative sample and is thus not required to regress a bounding box anymore. Here $m_{i}$ is the maximum distance that feature level i needs to regress. $m_{2}, m_{3}, m_{4}, m_{5}, m_{6}, m_{7}$ are set as 0, 64, 128, 256, 512, $\infty$, respectively. If a location, even with multi level prediction used, is still assigned to more than one ground truth boxes, FCOS simply choose the gorund truth box with minimal area as its target.
Finally FCOS share the heads between different feature lavels. However diffrent feature levels are required to regress diffrent size range, and therefor it is not reasonable to make use of identical heads for different feature levels. As a result instead of $exp(x)$, FCOS use $exp(s_{i}x)$ with trainable scalar $s_{i}$.
Center-ness for FCOS
After using multi-level prdiction in FCOS, there is still a performance gap between FCOS and anchor-based detectors. So authors propose a simple yet effective strategy to suppress these low quality detected bounding boxes without introducing any hyperparameters. Authors add a single layer branch, in parallel with classification branch to predict the "center-ness" of a location.The center-ness means the normalized distance from the location to the center of object. Given the regression targets $l^{*}, t^{*}, r^{*}, b^{*}$ for location.
Sqrt have a effect that slow down the decay of the center-ness. The center-ness ranges from 0 to 1 and is thus trained with BCE loss. Also final score is computed by multiplying the predicted center-ness with the corresponding classification score. Thus center-ness down weight the scores of bounding boxes far from the center of an object. So these object will be filterd out by NMS. An alternative of the ccenter-ness is to make use of only the central portion of ground truth bounding box as positive samples with one extra hyperparameter. It makes much metter performance
Reference
[IMG 1-4]: https://arxiv.org/pdf/1904.01355.pdf
'Review > - 2D Object Detection' 카테고리의 다른 글
Paper review: Scaled-YOLOv4(CVPR 2021) (0) | 2023.07.03 |
---|---|
Paper review: YOLOX(CVPR workshop 2021) (0) | 2023.06.26 |
Paper review: YOLOv4(CVPR 2020) (0) | 2023.04.18 |
Paper review: YOLOv3(arxiv 2018) (0) | 2022.11.27 |
Paper review: YOLOv2(CVPR 2017) (0) | 2022.10.20 |