RECODE: Reasoning Through Code Generation for Visual Question Answering

📅 2025-10-15

📈 Citations: 0

✨ Influential: 0

career value

174K/year

🤖 AI Summary

Multimodal large language models (MLLMs) suffer from low reasoning accuracy and poor verifiability when processing structured visual content such as charts. Method: This paper proposes a verifiable visual reasoning paradigm based on *derendering*—the inverse process of rendering—that parses images into executable program code, thereby transforming ambiguous pixel-level perception into deterministic symbolic reasoning. We introduce derendering as a novel modality and design an agent-based closed-loop framework integrating MLLM-driven code generation, execution feedback, critic-based evaluation, and iterative optimization. Results: Our approach significantly outperforms existing methods relying solely on visual features or superficial code-assisted plotting on benchmarks including CharXiv, ChartQA, and Geometry3K. It is the first to realize end-to-end verifiable visual reasoning—spanning perception, symbolic representation, verification, and optimization—in a unified framework.

Technology Category

Application Category

📝 Abstract

Multimodal Large Language Models (MLLMs) struggle with precise reasoning for structured visuals like charts and diagrams, as pixel-based perception lacks a mechanism for verification. To address this, we propose to leverage derendering -- the process of reverse-engineering visuals into executable code -- as a new modality for verifiable visual reasoning. Specifically, we propose RECODE, an agentic framework that first generates multiple candidate programs to reproduce the input image. It then uses a critic to select the most faithful reconstruction and iteratively refines the code. This process not only transforms an ambiguous perceptual task into a verifiable, symbolic problem, but also enables precise calculations and logical inferences later on. On various visual reasoning benchmarks such as CharXiv, ChartQA, and Geometry3K, RECODE significantly outperforms methods that do not leverage code or only use code for drawing auxiliary lines or cropping. Our work demonstrates that grounding visual perception in executable code provides a new path toward more accurate and verifiable multimodal reasoning.

Problem

Research questions and friction points this paper is trying to address.

Addresses imprecise reasoning in MLLMs for structured visuals like charts

Transforms visual perception into verifiable symbolic problems through code

Enables accurate calculations and logical inferences via executable programs

Innovation

Methods, ideas, or system contributions that make the work stand out.

Derendering visuals into executable code

Agentic framework generates candidate reconstruction programs

Iterative code refinement enables verifiable symbolic reasoning

🔎 Similar Papers

VCD: Knowledge Base Guided Visual Commonsense Discovery in Images