🤖 AI Summary
Geometric Problem Solving (GPS) requires models to jointly reason over text and diagrams while performing dynamic visual operations—such as constructing auxiliary lines or applying affine transformations—yet current multimodal large language models (MLLMs) treat diagrams as static images and lack interactive, executable visual action capabilities. This work introduces GeoSketch, the first neuro-symbolic framework for GPS, establishing a closed-loop pipeline of “perception → symbolic reasoning → differentiable drawing actions.” It formally models auxiliary line construction and affine transformations as executable, verifiable visual operations. The method employs a two-stage training strategy: symbolic-trajectory-supervised fine-tuning followed by symbolic-reward-guided reinforcement learning. Evaluated on the newly constructed GeoSketch benchmark, GeoSketch significantly outperforms state-of-the-art MLLMs, achieving substantial gains in both stepwise reasoning accuracy and final solution success rate—empirically demonstrating the critical role of dynamic visual operations in geometric reasoning.
📝 Abstract
Geometric Problem Solving (GPS) poses a unique challenge for Multimodal Large Language Models (MLLMs), requiring not only the joint interpretation of text and diagrams but also iterative visuospatial reasoning. While existing approaches process diagrams as static images, they lack the capacity for dynamic manipulation - a core aspect of human geometric reasoning involving auxiliary line construction and affine transformations. We present GeoSketch, a neural-symbolic framework that recasts geometric reasoning as an interactive perception-reasoning-action loop. GeoSketch integrates: (1) a Perception module that abstracts diagrams into structured logic forms, (2) a Symbolic Reasoning module that applies geometric theorems to decide the next deductive step, and (3) a Sketch Action module that executes operations such as drawing auxiliary lines or applying transformations, thereby updating the diagram in a closed loop. To train this agent, we develop a two-stage pipeline: supervised fine-tuning on 2,000 symbolic-curated trajectories followed by reinforcement learning with dense, symbolic rewards to enhance robustness and strategic exploration. To evaluate this paradigm, we introduce the GeoSketch Benchmark, a high-quality set of 390 geometry problems requiring auxiliary construction or affine transformations. Experiments on strong MLLM baselines demonstrate that GeoSketch significantly improves stepwise reasoning accuracy and problem-solving success over static perception methods. By unifying hierarchical decision-making, executable visual actions, and symbolic verification, GeoSketch advances multimodal reasoning from static interpretation to dynamic, verifiable interaction, establishing a new foundation for solving complex visuospatial problems.