Optical Reasoning: Rethinking Images as an Expressive Reasoning Medium Beyond Text

📅 2026-06-08

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work investigates whether images can serve as a self-contained medium for reasoning, independent of textual intermediaries. To this end, it introduces an optical reasoning framework that systematically treats images as complete carriers of logical inference, constructing structured visual reasoning pathways through typographic and graphical representations. This approach enables end-to-end reasoning without relying on intermediate textual steps and unifies the expression of reasoning across both language and multimodal tasks. Evaluated on mathematical, scientific, and multimodal benchmarks, the method matches or surpasses conventional text-based chain-of-thought reasoning. It achieves notable token efficiency, reducing reasoning tokens by 28.57% on average in language tasks and by 16% in multimodal tasks, yielding a 1.96× improvement in token efficiency over standard textual reasoning.

📝 Abstract

Chain-of-Thought (CoT) improves the performance of Large Language Models (LLMs) and has been extended to Multimodal Large Language Models (MLLMs). More recent work further moves from text-based multimodal reasoning toward interleaved-modal reasoning, where intermediate steps can incorporate both textual rationales and visual evidence. In this work, we propose a bolder and more ambitious idea: could images alone serve as the reasoning medium for both language and multimodal tasks? To explore this, we propose optical reasoning, which treats images as a standalone reasoning medium. We instantiate this concept with two variants: typographic-based optical reasoning, which optimizes visual layouts for compact rationale rendering, and graphical-based optical reasoning, which composes text and graphical elements into structured visual rationales. Across mathematical, scientific, and interleaved-modal reasoning benchmarks, optical reasoning can match or even exceed traditional text reasoning while reducing reasoning tokens by an average of 28.57% on language tasks and 16% on multimodal tasks, achieving 1.96 times the token efficiency of text reasoning. These results show that images can effectively and efficiently encode rationales while providing a unified visual canvas for reasoning.

Problem

Research questions and friction points this paper is trying to address.

optical reasoning

image-based reasoning

multimodal reasoning

reasoning medium

visual rationale

Innovation

Methods, ideas, or system contributions that make the work stand out.

Optical Reasoning

Visual Rationale

Token Efficiency