GeoLaux: A Benchmark for Evaluating MLLMs' Geometry Performance on Long-Step Problems Requiring Auxiliary Lines

📅 2025-08-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing geometric problem-solving (GPS) benchmarks overlook auxiliary-line construction and fine-grained process evaluation, hindering rigorous assessment of multimodal large language models’ (MLLMs) long-step reasoning capabilities. Method: We introduce GeoLaux—the first benchmark explicitly designed for auxiliary-line-dependent multi-step geometric reasoning—comprising 2,186 problems with an average of 6.51 reasoning steps, 41.8% requiring auxiliary lines, and covering both computational and proof-based tasks. We propose a novel five-dimensional evaluation framework quantifying answer correctness, reasoning process quality, auxiliary-line plausibility, impact magnitude, and error attribution, grounded in human annotations and model outputs to enable interpretable assessment of reasoning paths and auxiliary-line construction. Results: Experiments across 13 state-of-the-art MLLMs reveal that nine suffer >50% performance degradation in long-step reasoning and exhibit pervasive deficits in auxiliary-line awareness; targeted enhancement of this capability yields substantial improvements in overall geometric reasoning performance.

Technology Category

Application Category

📝 Abstract
Geometry problem solving (GPS) requires models to master diagram comprehension, logical reasoning, knowledge application, numerical computation, and auxiliary line construction. This presents a significant challenge for Multimodal Large Language Models (MLLMs). However, existing benchmarks for evaluating MLLM geometry skills overlook auxiliary line construction and lack fine-grained process evaluation, making them insufficient for assessing MLLMs' long-step reasoning abilities. To bridge these gaps, we present the GeoLaux benchmark, comprising 2,186 geometry problems, incorporating both calculation and proving questions. Notably, the problems require an average of 6.51 reasoning steps, with a maximum of 24 steps, and 41.8% of them need auxiliary line construction. Building on the dataset, we design a novel five-dimensional evaluation strategy assessing answer correctness, process correctness, process quality, auxiliary line impact, and error causes. Extensive experiments on 13 leading MLLMs (including thinking models and non-thinking models) yield three pivotal findings: First, models exhibit substantial performance degradation in extended reasoning steps (nine models demonstrate over 50% performance drop). Second, compared to calculation problems, MLLMs tend to take shortcuts when solving proving problems. Third, models lack auxiliary line awareness, and enhancing this capability proves particularly beneficial for overall geometry reasoning improvement. These findings establish GeoLaux as both a benchmark for evaluating MLLMs' long-step geometric reasoning with auxiliary lines and a guide for capability advancement. Our dataset and code are included in supplementary materials and will be released.
Problem

Research questions and friction points this paper is trying to address.

Evaluating MLLMs' geometry performance on long-step problems requiring auxiliary lines
Assessing MLLMs' reasoning abilities with fine-grained process evaluation
Improving MLLMs' auxiliary line awareness for better geometry problem solving
Innovation

Methods, ideas, or system contributions that make the work stand out.

GeoLaux benchmark for long-step geometry problems
Five-dimensional evaluation strategy for MLLMs
Enhancing auxiliary line awareness in models
Y
Yumeng Fu
School of Computer Science and Technology, Xi’an Jiaotong University, China
J
Jiayin Zhu
School of Computer Science and Technology, Xi’an Jiaotong University, China
Lingling Zhang
Lingling Zhang
Assistant Professor, Xi'an Jiaotong University
Computer visionFew-shot learningZero-shot learning
B
Bo Zhao
School of Computer Science and Technology, Xi’an Jiaotong University, China
S
Shaoxuan Ma
School of Computer Science and Technology, Xi’an Jiaotong University, China
Yushun Zhang
Yushun Zhang
The Chinese University of Hong Kong, Shenzhen, China
OptimizationDeep learning
Y
Yanrui Wu
School of Computer Science and Technology, Xi’an Jiaotong University, China
W
Wenjun Wu
School of Computer Science and Technology, Xi’an Jiaotong University, China