🤖 AI Summary
Current vision-language models lack spatial invariance and equivariance under geometric transformations such as rotation and scaling, leading to fragile object identity recognition. This work presents the first systematic evaluation of the robustness of mainstream vision-language models—spanning diverse architectures and scales—to geometric transformations across varied visual domains, including symbolic sketches, natural images, and abstract art, employing a multi-prompt testing strategy. The study reveals a significant performance drop when semantic content is sparse, exposing a fundamental gap between semantic understanding and spatial reasoning capabilities. Furthermore, it quantifies the models’ vulnerability stemming from the absence of geometric priors, underscoring the critical importance of incorporating geometric foundations to enhance the robustness of vision-language models.
📝 Abstract
This work investigates the fundamental fragility of state-of-the-art Vision-Language Models (VLMs) under basic geometric transformations. While modern VLMs excel at semantic tasks such as recognizing objects in canonical orientations and describing complex scenes, they exhibit systematic failures at a more fundamental level: lack of robust spatial invariance and equivariance required to reliably determine object identity under simple rotations, scaling, and identity transformations. We demonstrate this limitation through a systematic evaluation across diverse visual domains, including symbolic sketches, natural photographs, and abstract art. Performance drops sharply as semantic content becomes sparse, and this behavior is observed across architectures, model capacities, and prompting strategies. Overall, our results reveal a systematic gap between semantic understanding and spatial reasoning in current VLMs, highlighting the need for stronger geometric grounding in future multimodal systems.