ConvShareViT: Enhancing Vision Transformers with Convolutional Attention Mechanisms for Free-Space Optical Accelerators

📅 2025-04-15

📈 Citations: 0

✨ Influential: 0

career value

218K/year

🤖 AI Summary

To address the incompatibility of Vision Transformers (ViTs) with 4f free-space optical systems, this paper proposes ConvShareViT: a novel architecture that replaces both the linear layers in multi-head self-attention (MHSA) and MLP blocks with channel-shared depthwise convolutions, employing valid padding. This work is the first to systematically demonstrate that shared-weight depthwise convolutions can effectively model attention mechanisms. It further reveals the critical impact of padding strategy on attention learning capability. The resulting architecture is fully matrix-multiplication-free and optically compatible—a pure convolutional ViT. Through 4f-system-aware mapping optimization and structural reparameterization, ConvShareViT achieves comparable attention quality to standard ViTs while delivering a theoretical 3.04× inference speedup. Empirical evaluation validates both the feasibility and high performance of pure convolutional ViTs on photonic computing hardware.

Technology Category

Application Category

📝 Abstract

This paper introduces ConvShareViT, a novel deep learning architecture that adapts Vision Transformers (ViTs) to the 4f free-space optical system. ConvShareViT replaces linear layers in multi-head self-attention (MHSA) and Multilayer Perceptrons (MLPs) with a depthwise convolutional layer with shared weights across input channels. Through the development of ConvShareViT, the behaviour of convolutions within MHSA and their effectiveness in learning the attention mechanism were analysed systematically. Experimental results demonstrate that certain configurations, particularly those using valid-padded shared convolutions, can successfully learn attention, achieving comparable attention scores to those obtained with standard ViTs. However, other configurations, such as those using same-padded convolutions, show limitations in attention learning and operate like regular CNNs rather than transformer models. ConvShareViT architectures are specifically optimised for the 4f optical system, which takes advantage of the parallelism and high-resolution capabilities of optical systems. Results demonstrate that ConvShareViT can theoretically achieve up to 3.04 times faster inference than GPU-based systems. This potential acceleration makes ConvShareViT an attractive candidate for future optical deep learning applications and proves that our ViT (ConvShareViT) can be employed using only the convolution operation, via the necessary optimisation of the ViT to balance performance and complexity.

Problem

Research questions and friction points this paper is trying to address.

Adapts Vision Transformers for optical accelerators using convolutional attention

Analyzes convolution effectiveness in learning attention mechanisms systematically

Optimizes architecture for 4f optical system to enhance inference speed

Innovation

Methods, ideas, or system contributions that make the work stand out.

Replaces linear layers with shared depthwise convolutions

Optimized for 4f free-space optical systems

Achieves faster inference than GPU-based systems

🔎 Similar Papers

Streamlined optical training of large-scale modern deep learning architectures with direct feedback alignment