Paper review: Rewrite the Stars (CVPR 2024)

2025. 1. 13. 01:38Review/- Network

Rewrite the Stars

 

Motivation

Since AlexNet, a myriad of deep networks have emerged, each building on the other. Despite their characteristic instights and contributions, this line of models is mostly based on the blocks that blend linear projection with non-linear activations. Since self-attention, the most distinctive feature of self-attention is mapping features to different spaces and then constructing an attention matrix through dot-product multiplication. However, this implementation is not efficient, and results in the attention complexity scaling quadratically with the increase in the number of tokens. But element-wise multiplication (star operation) exhigit promising performance an defficiency and previous researchs are based on intuition and assumptions.

 

Main Idea

The inclusion of high-dimensional and nonlinear features is crucial in both traditional machine learning algorithms and deep learning networks.

In the deep learning, we typically start by linearly projecting low-dimensional features into a high-dimensional space and then introcude non-linearity using activation function. In contrast, we can simultaneously attain high-dimensionlity and non-linearity using kernel tricks in traditional machine learning algorithms.

A polynomial kernel function $k(x_1, x_2) = (\gamma x_1 ⋅ x_2 + c)^d$ can project the input feature $x_1, x_2 ∈ \mathbb{R}^n$ into a $(n + 1)^d$ high-dimensional non-linear feature space; Gaussian kernel function $k(x_1, x_2) = exp(-||x_1||^2)exp(-||x_2||^2)\sum_{i=0}^{+\infty}{\frac{(2x_i^Tx_2)^i}{i!}}$ cna result in an infinite-dimensional feature space through Talyer expansion.

As a comparison, we can observe that classical machine learning kernel methods and neural networks differ in their implementation and comprehension of high-dimensional and non-linear features. In this paper authors demonstrate that the star operation can obtain a high-dimensional and non-linear feature space with a low-dimensional input, akin th the principles of kernel tricks.

Star Operation in One layer

In a single layer, the star operation is typically written as $(W_1^TX + B_1) * (W_2^TX + B_2)$. For convenience, they consolidate the weight and bias, denoted by $W = \begin{bmatrix} W \\ B \end{bmatrix}$, $X = \begin{bmatrix} X \\ 1 \end{bmatrix}$, resulting star operation $(W_1^TX) * (W_2^TX). Specifically, they define $w_1, w_2, x ∈ \mathbb{R}^{(d+1)x1}, where $d$ is the input channel number.

Generally, they rewirte the star operation by:

[Eq-1] Star operation in one layer

They expand it into a composition of $\frac{(d+2)(d+1)}{2} \approx (\frac{d}{\sqrt{2}})^2$ implicit dimensional feature space without incurring and additional computational overhead within a single layer. of note is that this prominent property shares a similar philosophy as kernel functions.

Star Operation in Multiple layers

Next, they demonstrate that by stacking multiple layers, they can exponentially increase the implicit dimensions to nearly infinite in a recursive manner. In single layer star operation yields expression $\sum_{i=1}^{d+1}\sum_{j=1}^{d+1}{w_1^iw_2^jx^ix^j}$ in an implicit feature space of $\mathbb{R}^{{(\frac{d}{\sqrt{2}})^2}^1}$.

$O_l$ is denoted the output of $l$-th star operation.

[E1-2] Star operation in multiple layers

That is, with $l$ layers, they can implicitly obtain a feature space belonging to $\mathbb{R}^{{(\frac{d}{\sqrt{2}})^2}^l}$. Therefore, by stacking multiple layers, even just a few, star operations can substantially amplify the implicit dimensions in an exponential manner.

Cases of Star Operation

Case 1: Non-linear Nature of $W_1$ and/or $W_2$. This implement the transformation functions $W_1$ and/or $W_2$ as non-linear by incorporating activation functions. Nonetheless, a critical aspect is their maintenance of channel communications. The number of implicit dimensions is approximately $\frac{d^2}{2}$.

Case 2: $W_1^TX * X$. When removing the transformation $W_2$, the implicit dimension number decreases from approximatly $\frac{d^2}{2}$ to $2d$.

Case 3: $X * X$. In ths case, staer operation converts the feature from a feature space $\{x^1, x^2, ..., x^d\} ∈ \mathbb{R}^d$ to new space characterized by $\{x^1, x^2, ..., x^d\} ∈ \mathbb{R}^d$.

There are several notable aspects to consider. First star operations and their special cases are commonly integrated with spatial interactions as exemplified in VAN. Second it is feasible to combine these special cases, as demonstrated in Conv2Former, which merges Case 1 and Case 2, and in GENet, which blends elements fo Case1 and Case 3.

 

Experiments

Emprical superiority of star operation

Initially, they empirically validate the superiority of the star operation compared to simple summation. For this demonstration, DemoNet is designed to be straightforword, consisting of a convolutional layer that reduces the input resolution by a factor of 16, followed by a sequence of homogeneous demo blocks for feature extraction like [Figure 1]. Within each demo block, they apply either the star operation or the summation operation to amalgamate features from two distinct branches.

[Figuur 1, 2], [Table 1, 2] Comparison of two operation with width and depth

They can see that star operation consistently outperforms sum operation. Moreover, they observed that with the increase in netowrk width, the performance gains brought by the star operation gradually diminish. However, they did not observe a similar phenomenon in the case of varying depths, This disparity in behavior suggests to key insights: 1) The gradual decrease in the gains brought by the star operation is not a consequence of the model's enlarged size; 2) Based on this, it implies that the star operation does intrinsically expand the network's dimensionality, which in turn lessens the incremental benefiy of widening the netowrk.

Decision boundary comparison

Subsequently, they visually analyze and discern the differences between the star and sumation operations using 2D moon dataset.

[Figure 3,4] Decision boundary comparison

Notably, the observed differences in decision boundaries do not stem from non-linearity, as both operations incorporate activation functions in their respective building blocks. The primary distinction arises from the star operation's capability to attain exceedingly high dimensionality, a characteristic we have previously analyzed in detail. Further more the decision boundary produced by the star operation closely mirrors that of the polynomial kernel.

Extension to networks without activations

Activation functions are fundamental and indispensable components in neural networks. Without activation functions,traditional neural networks would collapse in to a single-layer netowrk due to the lack of non-linearity. In this experiment, while their primary focus is on the implicit high-dimensional feature achieved via star operations, the aspect of non-linearity also holds profound importance by removing all activations from DemoNet.

[Figure 5], [Table 3] Comparison of two operation with activation function

Comparison of Efficient models on ImageNet-1k (StarNet)

[Figure 6,7], [Table 4, 5] Comparison of efficient models on ImageNet-1k

Substituting the star operation

[Table 6] Gradually replacing star operation with summation

Latency impact of the star operation.

[Table 7] Latency comparison of different operations in StarNet

Study on the activation placement

[Table 8] Results of diverse activation placement in StarNet-S4

Conclusion

In this paper they have deleve into the intricate details of the star operation, going beyond the intuitive and plausible explanations as in previous ressearch. The star operation has strong representational capacity from implicitly high-dimensional spaces. In many ways, the star operation mirrors the behavior of poly nomial kernel functions. Their analysis was rigorously validated through empirical, theoretical, and visual methods. Based on these analysis, StarNet show impressive performance.

 

Reference

[Figure 1-7], [Table 1-8]: https://arxiv.org/pdf/2403.19967

'Review > - Network' 카테고리의 다른 글

Paper review: DCNv4 (CVPR 2024)  (0) 2024.09.05
Paper review: DCNv2 (CVPR 2019)  (0) 2024.09.03
Paper review: InternImage (CVPR 2023)  (0) 2024.08.11
Paper review: DCNv1 (ICCV 2017)  (0) 2024.07.25
Paper review: SHViT (CVPR 2024)  (0) 2024.05.31