🤖 AI Summary
Multimodal large language models (MLLMs) suffer from low reasoning accuracy and poor verifiability when processing structured visual content such as charts. Method: This paper proposes a verifiable visual reasoning paradigm based on *derendering*—the inverse process of rendering—that parses images into executable program code, thereby transforming ambiguous pixel-level perception into deterministic symbolic reasoning. We introduce derendering as a novel modality and design an agent-based closed-loop framework integrating MLLM-driven code generation, execution feedback, critic-based evaluation, and iterative optimization. Results: Our approach significantly outperforms existing methods relying solely on visual features or superficial code-assisted plotting on benchmarks including CharXiv, ChartQA, and Geometry3K. It is the first to realize end-to-end verifiable visual reasoning—spanning perception, symbolic representation, verification, and optimization—in a unified framework.
📝 Abstract
Multimodal Large Language Models (MLLMs) struggle with precise reasoning for structured visuals like charts and diagrams, as pixel-based perception lacks a mechanism for verification. To address this, we propose to leverage derendering -- the process of reverse-engineering visuals into executable code -- as a new modality for verifiable visual reasoning. Specifically, we propose RECODE, an agentic framework that first generates multiple candidate programs to reproduce the input image. It then uses a critic to select the most faithful reconstruction and iteratively refines the code. This process not only transforms an ambiguous perceptual task into a verifiable, symbolic problem, but also enables precise calculations and logical inferences later on. On various visual reasoning benchmarks such as CharXiv, ChartQA, and Geometry3K, RECODE significantly outperforms methods that do not leverage code or only use code for drawing auxiliary lines or cropping. Our work demonstrates that grounding visual perception in executable code provides a new path toward more accurate and verifiable multimodal reasoning.