🤖 AI Summary
Current vision-language models (VLMs) struggle with accurate identification of fundamental geometric primitives—such as points, lines, and orthogonality—and exhibit poor generalization across diverse diagram styles. To address these limitations, we propose GeoDANO, a geometry-aware vision-language model. Our approach comprises four key components: (1) introducing the first dedicated benchmark for geometric feature recognition; (2) designing GeoCLIP, a specialized vision encoder inspired by CLIP but augmented with geometric priors and trained on synthetically generated geometry-aware image–text pairs; (3) incorporating a domain-adaptation module to enhance robustness to unseen diagram styles; and (4) establishing a multi-stage geometric reasoning framework. Experiments demonstrate that GeoCLIP significantly outperforms generic encoders (e.g., OpenCLIP) on geometric feature recognition. Moreover, GeoDANO achieves state-of-the-art performance on the MathVerse benchmark, surpassing both prior domain-specific methods and GPT-4o—marking substantial advances in geometric understanding and cross-domain generalization.
📝 Abstract
We introduce GeoDANO, a geometric vision-language model (VLM) with a domain-agnostic vision encoder, for solving plane geometry problems. Although VLMs have been employed for solving geometry problems, their ability to recognize geometric features remains insufficiently analyzed. To address this gap, we propose a benchmark that evaluates the recognition of visual geometric features, including primitives such as dots and lines, and relations such as orthogonality. Our preliminary study shows that vision encoders often used in general-purpose VLMs, e.g., OpenCLIP, fail to detect these features and struggle to generalize across domains. We develop GeoCLIP, a CLIP based model trained on synthetic geometric diagram-caption pairs to overcome the limitation. Benchmark results show that GeoCLIP outperforms existing vision encoders in recognizing geometric features. We then propose our VLM, GeoDANO, which augments GeoCLIP with a domain adaptation strategy for unseen diagram styles. GeoDANO outperforms specialized methods for plane geometry problems and GPT-4o on MathVerse.