GeoDANO: Geometric VLM with Domain Agnostic Vision Encoder

📅 2025-02-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current vision-language models (VLMs) struggle with accurate identification of fundamental geometric primitives—such as points, lines, and orthogonality—and exhibit poor generalization across diverse diagram styles. To address these limitations, we propose GeoDANO, a geometry-aware vision-language model. Our approach comprises four key components: (1) introducing the first dedicated benchmark for geometric feature recognition; (2) designing GeoCLIP, a specialized vision encoder inspired by CLIP but augmented with geometric priors and trained on synthetically generated geometry-aware image–text pairs; (3) incorporating a domain-adaptation module to enhance robustness to unseen diagram styles; and (4) establishing a multi-stage geometric reasoning framework. Experiments demonstrate that GeoCLIP significantly outperforms generic encoders (e.g., OpenCLIP) on geometric feature recognition. Moreover, GeoDANO achieves state-of-the-art performance on the MathVerse benchmark, surpassing both prior domain-specific methods and GPT-4o—marking substantial advances in geometric understanding and cross-domain generalization.

Technology Category

Application Category

📝 Abstract
We introduce GeoDANO, a geometric vision-language model (VLM) with a domain-agnostic vision encoder, for solving plane geometry problems. Although VLMs have been employed for solving geometry problems, their ability to recognize geometric features remains insufficiently analyzed. To address this gap, we propose a benchmark that evaluates the recognition of visual geometric features, including primitives such as dots and lines, and relations such as orthogonality. Our preliminary study shows that vision encoders often used in general-purpose VLMs, e.g., OpenCLIP, fail to detect these features and struggle to generalize across domains. We develop GeoCLIP, a CLIP based model trained on synthetic geometric diagram-caption pairs to overcome the limitation. Benchmark results show that GeoCLIP outperforms existing vision encoders in recognizing geometric features. We then propose our VLM, GeoDANO, which augments GeoCLIP with a domain adaptation strategy for unseen diagram styles. GeoDANO outperforms specialized methods for plane geometry problems and GPT-4o on MathVerse.
Problem

Research questions and friction points this paper is trying to address.

Improving geometric feature recognition
Developing domain-agnostic vision encoder
Enhancing plane geometry problem solving
Innovation

Methods, ideas, or system contributions that make the work stand out.

Geometric VLM model
Domain-agnostic vision encoder
Synthetic geometric training data
🔎 Similar Papers
No similar papers found.
Seunghyuk Cho
Seunghyuk Cho
POSTECH
Generative modelHyperbolic spaceVAECrowdsourcing
Z
Zhenyue Qin
Independent Researcher
Y
Yang Liu
Independent Researcher
Youngbin Choi
Youngbin Choi
Pohang university of science and technology
Machine learning
S
Seungbeom Lee
Graduate School of Artificial Intelligence, POSTECH
D
Dongwoo Kim
Department of Computer Science and Engineering, POSTECH