QVGGT: Post-Training Quantized Visual Geometry Grounded Transformer

📅 2026-05-29

📈 Citations: 0

✨ Influential: 0

career value

185K/year

🤖 AI Summary

This work addresses the challenge of deploying the 1.2-billion-parameter VGGT model on resource-constrained devices such as drones and mobile AR systems by proposing an efficient post-training quantization method tailored for geometric perception tasks. By analyzing the varying sensitivity of Transformer blocks to quantization, the method employs a selective mixed-precision allocation strategy. It introduces a PCA-based global compensation token to recover geometric information lost during token filtering and integrates multi-head supervision with cross-head geometric consistency constraints to formulate a task-aware quantization scale search mechanism. The approach achieves near-lossless quantization under W4A16 settings, preserving accuracy across all prediction heads on multiple 3D geometric perception benchmarks while reducing memory footprint by 3–4.9× and accelerating on-device inference by up to 2.8×.

📝 Abstract

Estimating 3D attributes directly from images has advanced rapidly with the Visual Geometry Grounded Transformer (VGGT), which predicts camera parameters, depth maps, and point clouds in a single forward pass. However, its 1.2B-parameter scale severely limits deployment on resource-constrained platforms such as UAVs and mobile AR devices. To address this limitation, we introduce QVGGT, a tailored quantization framework designed to compress VGGT. Our approach starts from the observation that transformer blocks within VGGT exhibit heterogeneous sensitivity to quantization. We thus analyze per-block quantization sensitivity and propose a selective mixed-precision strategy that allocates higher precision to the most fragile transformer blocks. To address the amplification of quantization error caused by high-variance camera and register tokens, we further introduce token filtering with camera information compensation, which removes these outliers from activation calibration and restores their geometric cues using a PCA-derived global compensation token. Finally, we develop a task-aware scale search mechanism that evaluates candidate quantization scales not only through layer reconstruction but also through multi-head supervision and cross-head geometric consistency among camera poses, depth maps, and point maps. Extensive experiments on multiple geometry perception benchmarks demonstrate that QVGGT achieves near-lossless W4A16 quantization, preserving the accuracy of all 3D prediction heads while delivering 3$\sim$4.9$\times$ memory reduction and up to 2.8$\times$ real hardware speedup over FP32. Our approach makes high-fidelity 3D perception feasible on edge devices, enabling practical deployment of feed-forward 3D reconstruction models in real-world constrained environments.

Problem

Research questions and friction points this paper is trying to address.

3D perception

model quantization

resource-constrained deployment

Visual Geometry Grounded Transformer

edge computing

Innovation

Methods, ideas, or system contributions that make the work stand out.

post-training quantization

mixed-precision transformer

token filtering