Geometrically Constrained and Token-Based Probabilistic Spatial Transformers

📅 2025-09-14
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Fine-grained visual classification (FGVC) is highly sensitive to geometric deformations—including rotation, scaling, shearing, and perspective distortion—while existing equivariant architectures suffer from high computational overhead and limited representational capacity. To address this, we propose a Component-Decoupled Spatial Transformer (CD-ST), which embeds a spatial transformer network (STN) into a Transformer backbone and independently regresses affine transformation components—rotation, scaling, and shearing—to enhance pose robustness. We introduce a novel component-wise Gaussian variational posterior to model uncertainty in transformation estimation and design geometric consistency and component-alignment losses to achieve learnable, interpretable, and generalizable sampling normalization. The localization network shares an encoder with the main branch and leverages data-augmentation-guided spatial alignment. On a moth fine-grained classification benchmark, CD-ST significantly outperforms conventional STN-based methods, demonstrating superior geometric invariance and classification robustness under diverse deformations.

Technology Category

Application Category

📝 Abstract
Fine-grained visual classification (FGVC) remains highly sensitive to geometric variability, where objects appear under arbitrary orientations, scales, and perspective distortions. While equivariant architectures address this issue, they typically require substantial computational resources and restrict the hypothesis space. We revisit Spatial Transformer Networks (STNs) as a canonicalization tool for transformer-based vision pipelines, emphasizing their flexibility, backbone-agnostic nature, and lack of architectural constraints. We propose a probabilistic, component-wise extension that improves robustness. Specifically, we decompose affine transformations into rotation, scaling, and shearing, and regress each component under geometric constraints using a shared localization encoder. To capture uncertainty, we model each component with a Gaussian variational posterior and perform sampling-based canonicalization during inference.A novel component-wise alignment loss leverages augmentation parameters to guide spatial alignment. Experiments on challenging moth classification benchmarks demonstrate that our method consistently improves robustness compared to other STNs.
Problem

Research questions and friction points this paper is trying to address.

Addressing geometric variability in fine-grained visual classification
Probing probabilistic spatial transformers with geometric constraints
Enhancing robustness in object classification under distortions
Innovation

Methods, ideas, or system contributions that make the work stand out.

Probabilistic component-wise affine transformation decomposition
Gaussian variational posterior modeling uncertainty
Component-wise alignment loss using augmentation parameters