Language-Guided Abstraction for Visual Reasoning

📅 2026-06-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses limitations in existing Abstract Reasoning Corpus (ARC) methods, which either rely on excessively large language models leading to parameter redundancy or depend solely on visual cues and thus struggle to capture high-level semantics, often overfitting to superficial pixel patterns. To overcome these issues, the authors propose L-VARC, a lightweight framework incorporating a Language-guided Privileged Information (LUPI) learning mechanism: during training, linguistic semantics guide and enhance visual reasoning, while the language branch is discarded at inference to maintain efficiency. The approach introduces a semantic compression module and a cross-attention projector to structure task-agnostic language descriptions and effectively align visual and semantic features. With only 18 million parameters, L-VARC outperforms current state-of-the-art methods on ARC tasks, and ablation studies confirm the contribution of each proposed component.
📝 Abstract
The Abstraction and Reasoning Corpus (ARC) is viewed as a critical avenue to Artificial General Intelligence (AGI), as it enables models to learn abstract transformation rules from few-shot examples and then generalize to new tasks. However, prevalent ARC methodology is either pure language or vision-only (i.e., VARC). The former depends heavily on LLMs, consuming billions of parameters. The latter often struggles to capture high-level semantics, leading to overfitting on pixel-level patterns. To bridge this gap, we propose L-VARC, a novel framework that enhances visual reasoning via a language-guided Learning Using Privileged Information (LUPI) branch. Specifically, we design a Semantic Compression Module by feeding a unified, task-agnostic prompt into DeepSeek-V3. In this way, the raw LARC (a crowd-sourced language description dataset) can be substantially refined and structured, fitting with the context length constraint of standard text encoders (e.g., CLIP). Moreover, we design a Cross-Attention Projector to align visual features with semantic embeddings, aiming to guide the training of the ARC model. Notably, the LUPI branch is taken in the training process and will be discarded during inference, thereby yielding a lightweight model with a mere 18 million parameters. Extensive experiments demonstrate that our L-VARC effectively leverages linguistic priors to boost visual reasoning and outperforms state-of-the-art. Ablation studies further confirm the contribution of the two new designs towards the L-VARC framework. The code is available at https://github.com/GZHU-DVL/L-VARC.
Problem

Research questions and friction points this paper is trying to address.

Visual Reasoning
Abstraction and Reasoning Corpus
Language-Guided Learning
Semantic Abstraction
Few-Shot Generalization
Innovation

Methods, ideas, or system contributions that make the work stand out.

Language-Guided Abstraction
Learning Using Privileged Information (LUPI)
Semantic Compression Module
Cross-Attention Projector
Visual Reasoning
🔎 Similar Papers
2024-09-12arXiv.orgCitations: 1