I-Segmenter: Integer-Only Vision Transformer for Efficient Semantic Segmentation

📅 2025-09-12

📈 Citations: 0

✨ Influential: 0

career value

165K/year

🤖 AI Summary

To address the severe accuracy degradation of Vision Transformer (ViT)-based semantic segmentation models on resource-constrained devices under low-bit quantization—caused by accumulated quantization errors—this paper proposes the first end-to-end fully integer-only quantized ViT framework. Our method introduces three key innovations: (1) a λ-ShiftGELU activation to mitigate quantization distortion induced by feature long-tail distributions; (2) elimination of L2 normalization and bilinear interpolation, replaced by nearest-neighbor upsampling to ensure strict integer arithmetic consistency; and (3) integration of post-training quantization (PTQ) with structural reparameterization for floating-point-free inference. Evaluated on the Segmenter architecture, our approach achieves only a 5.1% mIoU drop while compressing model size by 3.8× and accelerating inference by 1.2×. Notably, it attains practical accuracy with single-image calibration, significantly enhancing robustness and efficiency for ultra-low-bit deployment.

Technology Category

Application Category

📝 Abstract

Vision Transformers (ViTs) have recently achieved strong results in semantic segmentation, yet their deployment on resource-constrained devices remains limited due to their high memory footprint and computational cost. Quantization offers an effective strategy to improve efficiency, but ViT-based segmentation models are notoriously fragile under low precision, as quantization errors accumulate across deep encoder-decoder pipelines. We introduce I-Segmenter, the first fully integer-only ViT segmentation framework. Building on the Segmenter architecture, I-Segmenter systematically replaces floating-point operations with integer-only counterparts. To further stabilize both training and inference, we propose $λ$-ShiftGELU, a novel activation function that mitigates the limitations of uniform quantization in handling long-tailed activation distributions. In addition, we remove the L2 normalization layer and replace bilinear interpolation in the decoder with nearest neighbor upsampling, ensuring integer-only execution throughout the computational graph. Extensive experiments show that I-Segmenter achieves accuracy within a reasonable margin of its FP32 baseline (5.1 % on average), while reducing model size by up to 3.8x and enabling up to 1.2x faster inference with optimized runtimes. Notably, even in one-shot PTQ with a single calibration image, I-Segmenter delivers competitive accuracy, underscoring its practicality for real-world deployment.

Problem

Research questions and friction points this paper is trying to address.

Enabling efficient integer-only semantic segmentation with Vision Transformers

Reducing computational cost and memory footprint for resource-constrained devices

Mitigating quantization errors in deep encoder-decoder transformer pipelines

Innovation

Methods, ideas, or system contributions that make the work stand out.

Integer-only Vision Transformer architecture

Novel activation function for quantization stability

Computational graph modifications for integer execution

🔎 Similar Papers

SimPLR: A Simple and Plain Transformer for Efficient Object Detection and Segmentation