Decoupling Skeleton and Flesh: Efficient Multimodal Table Reasoning with Disentangled Alignment and Structure-aware Guidance

📅 2026-02-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limitations of large vision-language models in table image reasoning, which struggle with complex layouts and the entanglement of structure and content. Existing approaches often rely on costly annotations, reinforcement learning, or external tools, hindering scalability and efficiency. To overcome these challenges, we propose an efficient multimodal table reasoning framework that requires neither external tools nor extensive labeled data. Our method decouples structural abstraction from semantic alignment and introduces a global-to-local structure-guided reasoning mechanism (Table-GLS), integrated with structure-aware multimodal alignment and evidence-driven reasoning (DiSCo). Experiments demonstrate that our approach significantly enhances table understanding performance on standard benchmarks and exhibits strong generalization to unseen table structures.

Technology Category

Application Category

📝 Abstract
Reasoning over table images remains challenging for Large Vision-Language Models (LVLMs) due to complex layouts and tightly coupled structure-content information. Existing solutions often depend on expensive supervised training, reinforcement learning, or external tools, limiting efficiency and scalability. This work addresses a key question: how to adapt LVLMs to table reasoning with minimal annotation and no external tools? Specifically, we first introduce DiSCo, a Disentangled Structure-Content alignment framework that explicitly separates structural abstraction from semantic grounding during multimodal alignment, efficiently adapting LVLMs to tables structures. Building on DiSCo, we further present Table-GLS, a Global-to-Local Structure-guided reasoning framework that performs table reasoning via structured exploration and evidence-grounded inference. Extensive experiments across diverse benchmarks demonstrate that our framework efficiently enhances LVLM's table understanding and reasoning capabilities, particularly generalizing to unseen table structures.
Problem

Research questions and friction points this paper is trying to address.

table reasoning
Large Vision-Language Models
multimodal alignment
structure-content decoupling
annotation efficiency
Innovation

Methods, ideas, or system contributions that make the work stand out.

Disentangled Alignment
Structure-aware Guidance
Table Reasoning
Multimodal Alignment
LVLM Adaptation
🔎 Similar Papers
No similar papers found.