🤖 AI Summary
To address the high computational cost and heavy data dependency of Vision Transformers (ViTs) in visual tasks, this paper proposes NiNformer: a lightweight architecture that replaces standard self-attention layers with Network-in-Network (NiN) blocks. It introduces a learnable, element-wise dynamic gating mechanism driven by token mixing to enable efficient feature transformation. Unlike conventional ViTs, NiNformer abandons global attention and static MLP-based fusion, instead pioneering the integration of NiN-style hierarchical convolutional abstraction with token-mixing–driven gating. This design preserves strong representational capacity while drastically reducing FLOPs. Extensive experiments demonstrate that NiNformer consistently outperforms ViT, MLP-Mixer, and Conv-Mixer on mainstream image classification benchmarks—including ImageNet—achieving higher accuracy with significantly lower computational cost. The work establishes a novel, efficient paradigm for vision modeling grounded in architectural innovation rather than scale.
📝 Abstract
The attention mechanism is the primary component of the transformer architecture; it has led to significant advancements in deep learning spanning many domains and covering multiple tasks. In computer vision, the attention mechanism was first incorporated in the vision transformer (ViT), and then its usage has expanded into many tasks in the vision domain, such as classification, segmentation, object detection, and image generation. While the attention mechanism is very expressive and capable, it comes with the disadvantage of being computationally expensive and requiring datasets of considerable size for effective optimization. To address these shortcomings, many designs have been proposed in the literature to reduce the computational burden and alleviate the data size requirements. Examples of such attempts in the vision domain are the MLP-Mixer, the Conv-Mixer, the Perceiver-IO, and many more attempts with different sets of advantages and disadvantages. This paper introduces a new computational block as an alternative to the standard ViT block. The newly proposed block reduces the computational requirements by replacing the normal attention layers with a network in network structure, therefore enhancing the static approach of the MLP-Mixer with a dynamic learning of element-wise gating function generated by a token-mixing process. Extensive experimentation shows that the proposed design provides better performance than the baseline architectures on multiple datasets applied in the image classification task of the vision domain.