2024. 4. 4. 00:29ㆍReview/- Network
Motivation
Recent SOTA models require a high cost of computation and memory in inference. One promising direction for mitigating this computational burden is to transfer knowledge. Previous approches can be expressed as a from of training the student to mimic output activations of individual data examples represented by the teacher. So authors introduce relational knowledge distillation(RKD), that transfers mutual relations of data examples instead. And they propose distance-wise and angle-wise distillation losses that penalize structural differences in relations.
Main Idea
The central tenet of research is that what consititues the knowledge is better presented by relations than individuals of those. Thus primary information lies in the structure in the
data embedding space.

Notation. Given a teacher model T and a student model S, let $f_T$ and $F_S$ be functions of teacher and student. Function f can be defined using output of any layer of the network. Authors denote by $X^N$ a set of N-tutples of distinct data examples.
Conventional knowledge distillation
Conventional KD methods can commoly be expressed as minimizing the object function.

For example we can use right loss functions. If we have diffrent dimensions from f, we can use linear mapping $\beta$.
Relational knowledge distillation
RKD aims at transferring structural knowledge using mutual relations of data examples in the teacher's output presentation. Unlike conventional approaches, it computes a relational potential $\psi$ for each n-tuple of data eamples and transfers information through the potential from the teacher to the student.
Let us define $t_i$ = $f_T(x_i)$ and $s_i$ = $f_s(x_i)$.

($x_1$, $x_2$, ..., $x_n$) is a n-tuple drawn from $X^N$, $/psi$ is a relational potential function that measures a relational envergy of the given n-tuple. RKD trains the student model to from the same relational structure with teacher. It is able to transfer knoledge of high-order properties, even regardless of difference in output dimensions between the teacher and the student. A higher-order potential may be powerful in capturing a higher-level structure but be more expensive. In this paper authors propose two kinds of methods.
Distance-wise distillation loss
Distance-wise potential function $\psi_D$ measures the Euclidean distance between the two examples in the output representation space. Where $\mu$ is a normalization factor for distance. To focus on relative distance among other pairs, authors set $\mu$ to be the average distance in mini-batch

This mini-batch distance normalization is usefull particularly when there is a significant difference in scales teacher idstances and student distances. It lead model more stable and faster convergence in training.
Using the distance-wise potentials measured in both the teacher and the student, a distance-wise distillation loss is difined. where $l_{\delta}$ is huber loss

The distance-wise distillation loss transfers the relationship of examples by penalizing distance differences between their output representation spaces.
Angle-wise distillation loss
Angle-wise relation potential measures the angle formed by the three examples in the output representation space.

Where $l_{delta}$ is the Huber loss. The angle-wise distillation loss transfers the relationship of training example embedings by penalizaing angular differences. Since an angle is a higher-order property than a distance, it transfer relational information more effectively, giving more flexibility to the student in training.
Training with RKD
During training, multiple distillation loss functions, including the proposed RKD losses, can be used either alone or together with task-specific loss functions.

Experiments
Authors evaluate the proposed method on metric learning, image classification, few-shot learning. They use RKD-DA loss fucntion for training that using RKD-D and RKD-A.
Metric learning
Metric learning aims to train an embedding model that projects data examples onto manifold where two examples are close to each other if they are semantically similar and otherwise far aprat. As embedding models are commonly evaluated on image retrieval. For an evaluation metric, recall@K is used. Once all test images are embedded using a model, each test image is used as a query and top K nearest neighbor images are retrieved from the test set excluding the query. Recall@K are computed by taking the average recall over the whole test set.

Triplet. When an anchor $x_a$, positive $x_p$ and negative $x_n$ are given, the triplet loss enforces the squared euclidean distance between anchor and negative to be larger than that between anchor and positive by margin m.

L2 normalization. The results show that RKD benefits from training without l2 Normalization.
Self-distillation. RKD is able to improve smaller student models over its teacher, and RKD is able to improve same model using self-distillation. Allmodels consistently outperform initial teacher models, which are trained with the triplet loss.

Comparison with SOTA. Authors compare the result of RKD with SOTA methods for metric learning.

Students excelling teachers. The work of efect expalins that the soft output of class distribution from the teacher may carry additional information, which cannot be encoded in onehot vectors of ground-truth labels.
RKD as a training domain adaptation. Effective adptation to specific characteristics of the domain may be crucial. To measure the degree of adaptation of a model trained with RKD losses, authors compare recall@1 on a training data domain with those on different data domains.
Image Classification
Authors also validate the proposed method on the task of image classification by comparing RKD with IKD methods. The overall results reveal that the proposed RKD method is complementary to other KD methods.

Few-shot learning
Finally they validate the proposed method on the task of few-shot learning, which aims to learn a classifier that generalized to new unseen classes with only a few examples for each new class.

Conclusion
They demonstrated on different tasks and benchmarks that the proposed RKD effectively transfers knowledge using mutual relations of data examples. In particular for metric learning, RKD enables smaller students to even outperform their larger teachers.
Refference
[Figure-1~10], [Table-1~3]: https://arxiv.org/pdf/1904.05068.pdf
'Review > - Network' 카테고리의 다른 글
| Paper review: MLP-mixer (NeurIPS 2021) (0) | 2024.04.17 |
|---|---|
| Paper review: ViT (NeurIPS 2021) (0) | 2024.04.15 |
| Paper review: CSPNet(CVPR 2020) (0) | 2024.04.03 |
| Paper review: RepVGG(CVPR 2021) (0) | 2023.07.24 |
| Paper review: VoVNet(CVPR 2020) (0) | 2023.07.11 |