🤖 AI Summary
To address the low reasoning efficiency and limited self-correction capability of large language models (LLMs) and multimodal LLMs (MLLMs) in natural language understanding, this paper proposes INoT—a self-introspective reasoning framework. INoT embeds executable code directly into prompts, enabling models to perform self-reflection, negation, and correction within a single forward pass—without post-training or additional parameters. The method integrates programmatic prompt engineering with multimodal dialogue design to realize an intrinsic, lightweight introspective reasoning mechanism. Evaluated across six benchmark tasks, INoT achieves an average accuracy improvement of 7.95% over strong baselines while reducing inference token consumption by 58.3% compared to the best-performing baseline. These results demonstrate INoT’s effectiveness, generalizability, and deployment efficiency.
📝 Abstract
AI Agents rely on Large Language Models (LLMs) and Multimodal-LLMs (MLLMs) to perform interpretation and inference in text and image tasks without post-training, where LLMs and MLLMs play the most critical role and determine the initial ability and limitations of AI Agents. Usually, AI Agents utilize sophisticated prompt engineering and external reasoning framework to obtain a promising interaction with LLMs, e.g., Chain-of-Thought, Iteration of Thought and Image-of-Thought. However, they are still constrained by the inherent limitations of LLM in understanding natural language, and the iterative reasoning process will generate a large amount of inference cost. To this end, we propose a novel AI Agent Reasoning Framework with Introspection of Thought (INoT) by designing a new LLM-Read code in prompt. It enables LLM to execute programmatic dialogue reasoning processes following the code in prompt. Therefore, self-denial and reflection occur within LLM instead of outside LLM, which can reduce token cost effectively. Through our experiments on six benchmarks for three different tasks, the effectiveness of INoT is verified, with an average improvement of 7.95% in performance, exceeding the baselines. Furthermore, the token cost of INoT is lower on average than the best performing method at baseline by 58.3%. In addition, we demonstrate the versatility of INoT in image interpretation and inference through verification experiments.