2022. 10. 20. 02:13ㆍReview/- 2D Object Detection
Motivation
YOLO_v1 makes a significant nmber of localization errors, and has relatiely low recall compared to RPN.
Thus YOLO_v2 focus mainly on improving recall and localization.
Main Idea
YOLO introduce YOLO_v2 and YOLO9000.
Better
YOLO_v2 want a more accurate detector that is till fast. Instead of scaling up network.
1. Batch Normalization
Batch normalization means if features of each batchses have various distribution, normalize weight with using mean, variance before activation work.
· $BN(X) = \gamma(\frac{X-\mu_{batch}}{\sigma_{batch}})+\beta$
· $\mu_{batch}=\frac{1}{B}\sum_{i}x_{i}$
· $\sigma_{batch}^{2}=\frac{1}{B}\sum_{i}(x_{i}-\mu_{batch})^2$
It works on batch_size is not too small or big. Because it is divided by batch_size.
If model in inference step, model use mean, variance of training step.
Batch normalization leads to significant improvments in convergence while eliminating regularization.
YOLO_v2 get effect 2% improvement in mAP
2. High Resolution Classifier
YOLO_v1 trains network at 224x224 and increase the resolution to 448 for detection. This means the network learn object detection and adjust to the new input resolution.
YOLO_v2 first fine tune the classification network at the 448x448 resolution for 10 epochs on ImageNet. And ten fine tune the network on detection. YOLO_v2 get effect 4% improvement in mAP
3. Convolutional With Anchor Boxes
YOLO_V1 predicts the coordinates of bouding boxes directly using FC layers on top of the Conv feature extractor.
YOLO_v2 choose concept of offsets from Faster R-CNN instead of coordinates. It simplifies the problem and makes it easier for the network to learn. And YOLO_v2 remove FC layers and use anchor boxes.
YOLO_v2 want an odd number of locations in feature map for single center cell. So YOLO_v2 shrink 416 input images instead of 448x448. Conv layers downsample the image by a factor of 32. (416->13). It is better than four center cell.
Using anchor boxes we get a small decrease in accuracy. without anchor boxes model gets 69.5 mAP, 81% recall. with anchor boxes model get 69.2mAP, 88 recall. Even though the mAP decreases, the increase in recall means anchor box get more room to improve.
4. Dimension Clusters
Anchor boxes have two issues. The first is that the box dimensions are hand picked. Instead of choosing priors by hand, YOLO_v2 run k-means clustering on the training set bounding boxes to automatically find good priors.
d(box, centroid) = 1 - IOU(box, centroid)
YOLO_v2 run k-means for various values of k and plot the average IOU with closest centroid.
YOLO_2 compare the average IOU to closest prior of clustering stategy and the hand-picked anchor boxes.
k=5 as good trade off between model complexity and high recall.
5. Direct location prediction
Anchor boxes have a second issue: model instability. It comes from predicting the (x, y) locations for the box.
$x = (t_{x} * w_{a}) - x_{a}$
$y = (t_{y} * h_{a}) - y_{a}$
This formulation is unconstrained so any anchor box can end up at any point ins the image. YOLO_v2 predict 5 bounding boxes at each cell in the output feature map: $t_{x}$, $t_{y}$, $t_{w}$, $t_{h}$, $t_{o}$.
![]() |
$b_{x}$, $b_{y}$, $b_{w}$, $b_{h}$ | (x,y) and (w,h) of bounding box |
$c_{x}$, $c_{y}$ | left, top (x,y) of cell | |
$p_{w}$, $p_{h}$ | (w,h) of anchor box | |
$\sigma$ | logisitic activate function(sigmoid) |
YOLO_v2 constrain the location prediction the parametrization is easier to learn, making the network more stable.
Using both dimension clusters and direct location prediction improves YOLO_v2 almost 5%.
6. Fine-grained Features
YOLO_v2 predicts detection on a 13x13 feature map. So it may not be able to detect samll object well. YOLO_v2 simply adding a passthrough layer that brings features from an earier layer at 26x26 resolution.
This turns the 26x26x512 feature map in a 13x13x2048 feature map, which can be concatenated with the original features. This gives a modest 1% performance increase
7. Multi-Scale Training
YOLO_v2 use input resolution of 416x416. YOLO_v2 want network to be robust to running on images of various sizes. Instead of fixing the input image sizes we change the network every few iterations. Every 10 batches our network randomly chooses a new image dimension size. YOLO_v2 pull from the following mulitples of 32: {320, 352, ..., 604}. The smallest option is 320x320 and the largest is 608x608.
Faster
YOLO_v2 want to be accurate and fast. Most detection framework is based on VGG-16. But VGG-16 is needlessly complex. So YOLO_v2 uses a custom network based on Googlenet. This network is faster than VGG16. However it's acuuracy is slightly worse than VGG16(88%, 90% in ImageNet).
1. Darknet-19
YOLO_v2 use mostly 3x3 filters and double the number of channels after every pooling step. Plus YOLO_v2 use global average pooling to make predictions as well as 1x1filters to compress the feature represectation between 3x3 conv on Network in Network(NIN). Darknet-19 has 19 conv layer and 5 maxpooling layers.
2. Training for classification
Darknet-19 train the network on the standard ImageNet 1000 class classification. Darknet-19 first train network on images at 224x224 for 160 epochs using SGD with 0.1 learning rate, ploynomial reate decay with power of 4, weight dacay of 1e-5 and momentum of 0.9. And fine tune network 448x448 only 10 epochs with 1e-3 learning rate.
3. Training for detection
YOLO_v2 modify darknet-19 for detection after training for classification. It remove the last conv layer and instead adding on three 3x3 convolutional layer with 1024 filters each followed by 1x1 conv layer with the number of outputs we need for detection. YOLO_v2 predict 5boxes with 5 coordinates each and c classes per box so 5(5+c) filters.
YOLO_v2 train the network for 160 epochs with 1e-3 learning rate, dividing it by 10 at 60 and 90 epochs, weight decay of 1e-5 and momentum of 0.9. YOLO_v2 use a similar data augmentation with random crop, color shifting. etc.
Stronger
YOLO_v2 propose a mechanism for jointly training on classification and detection data. When network sees an image labelled for detection network can backpropagate based on the full YOLO_v2 loss function. When network see a classification image network only backpropagate loss from the classification specific parts of the architecture.
This approach present a few challenges. Detection datasets have only common object and general labels, like "dog". Classification datasets have a much wider and deeper range of labels, like "Norfolk terrier", "Yorkshire terrier". This cause problem in softmax. Using a softmax assumes the classes ar mutually exclusive. But "dog" in COCO and "Norfolk terrier" are not mutually exclusive.
To solve this problem YOLO_v2 use multi-label model.
1) Hierarchical classification
Image labels ar pulled from wordNet. WordNet is structed as a directed graph. YOLO_v2 build hierarchical tree from the concepts in ImageNet. To build this tree YOLO_v2 examine the visual nouns and look at their paths throught the WordNet graph to the root node. Many synsets only have one path through the graph so first add all of those paths to WordTree. Then YOLO_v2 iteratively examine the concepts they have left and add the paths that grow the tree by as little as possible. If a concept has two paths to root, YOLO_v2 choose the shorter path.
To perform classification with WordTree YOLO_v2 predict conditional probabilityes at every node for proability of each hyponym.
Pr(Norfolk terrier|terrier)
Pr(Yorkshire terrier|terrer)
If YOLO_v2 want to compute the abolute probability for a particular node, simply follow the path throught the tree to the root node and multiply to conditional probabilities.
Pr(Norfolk terrier) = Pr(Norfolk terrier|terrier)
*Pr(terrier|hunting dog)
*...*
*Pr(animal|physical object)
Pr(phsical object)=objectness proability
During training YOLO_v2 propagate ground thruth labels up the tree. If an image is labelled as a "Norfolk terrier" it also get labelled as "dog" and a "mammal", etc.
To compute the conditional probabilites model predicts a vector of 1368 values and compute the softmax over all sysnset the are hyponyms.
2) Dataset combination with WordTree
YOLO_v2 can use WordTree to combine multiple datasets together in COCO, ImageNet, others.
3) Joint classification and detection
YOLO_v2 combine Imagenet, COCO detection set using WordTree. this dataset has 9418 classes. But ImageNet is a much larger dataset so considering the amount of data from ImageNet, COCO adjust the ratio to be 4:1 in training. Using this dataset YOLO_v2 train YOLO9000 using 3 output.(YOLO9000 can predict detection more than 9000)
When network sees a detection image, YOLO9000 backpropagate loss as normal. For classification loss, YOLO9000 backpropagate loss at or above the corresponding level of the label. If the label is "dog" YOLO9000 consider "animal" not "terrier". When it sees a classification image YOLO9000 only backpropagate classification loss. It makes to simply find the bounding box that predicts the highest probability and compute the loss on predicted tree. YOLO9000 set at least 0.3 IOU to backpropagate objectness.
Reference
[IMG-1,2,4~7]: https://arxiv.org/pdf/1612.08242.pdf
[IMG-3]: https://velog.io/@skhim520/YOLO-v2-논문-리뷰
[Implementation]: https://github.com/kongbuhaja/YOLO_v2
'Review > - 2D Object Detection' 카테고리의 다른 글
Paper review: YOLOv4(CVPR 2020) (0) | 2023.04.18 |
---|---|
Paper review: YOLOv3(arxiv 2018) (0) | 2022.11.27 |
Paper review: SSD(ECCV 2016) (0) | 2022.10.09 |
Paper review: YOLOv1(CVPR 2016) (0) | 2022.10.01 |
Paper review: Faster R-CNN(NeurIPS 2015) (0) | 2022.09.22 |