ReGuLaR: Relation-Grounded Latent Reasoning for Large Vision-Language Models

📅 2026-05-28

📈 Citations: 0

✨ Influential: 0

career value

170K/year

🤖 AI Summary

Current large vision-language models struggle to effectively model the compositional structure and relational semantics of visual evidence in continuous latent spaces, limiting their reasoning capabilities. This work proposes a relation-grounded latent reasoning framework that, for the first time, explicitly incorporates inter-object relational structures into latent reasoning states. To support this approach, we introduce RGROUNDING-351K, the first large-scale real-world dataset annotated with both object bounding boxes and explicit relational labels. We further design the ReGFormer module to enable relation-aware latent representation learning, synergistically combining chain-of-thought reasoning with continuous latent inference for efficient vision-language joint modeling. Extensive experiments demonstrate that our method significantly outperforms existing approaches across multiple benchmarks, achieving state-of-the-art performance and confirming the critical role of relation grounding in enhancing visual-language reasoning.

📝 Abstract

Chain-of-thought (CoT) reasoning has significantly improved the reasoning ability of large vision-language models (LVLMs) by verbalizing intermediate reasoning steps in natural language. However, such discrete textual rationales are often insufficient for encoding continuous visual evidence. Recent work addresses this limitation by moving reasoning into continuous latent space. Despite promising progress, existing methods leave latent reasoning insufficiently connected to the compositional and relational structure of visual evidence. To address this gap, we introduce ReGuLaR, a relation grounded latent reasoning framework that explicitly grounds latent states in these critical yet overlooked visual evidence. ReGuLaR uses a training-time ReGFormer to focus latent reasoning on question-relevant objects and inter-object relations, while at inference time the model reasons and generates answers without invoking the ReGFormer. To support training ReGuLaR, we construct RGROUNDING-351K, a real-world vision-language dataset annotated with key object bounding boxes and inter-object relations. Extensive experiments across diverse benchmarks show that ReGuLaR consistently outperforms existing approaches and achieves state-of-the-art performance. We include our code in the submission and will release the code and training data publicly upon acceptance.

Problem

Research questions and friction points this paper is trying to address.

vision-language models

latent reasoning

relational structure

visual evidence

chain-of-thought

Innovation

Methods, ideas, or system contributions that make the work stand out.

relation-grounded reasoning

latent reasoning

vision-language models