🤖 AI Summary
To address object hallucination in vision-language models (VLMs) for autonomous driving—caused by text-dominant chain-of-thought (CoT) reasoning—this paper proposes iMCoT, an end-to-end trainable interleaved multimodal chain-of-thought framework. Methodologically, iMCoT jointly optimizes perception and reasoning instead of decoupling them. It introduces the first reinforcement-driven autonomous visual grounding mechanism, leveraging unsupervised, process-oriented cross-modal consistency rewards—eliminating reliance on dense bounding-box annotations or external tools. Furthermore, it designs a two-stage reinforcement learning pipeline and a custom Clip-GRPO algorithm to coordinate visual attention and textual generation. Evaluated on DriveLMM-o1, iMCoT achieves an inference score of 80.35% (+28.58) and final answer accuracy of 73.62% (+35.81), significantly outperforming the Qwen2.5VL-7B baseline.
📝 Abstract
The deployment of Vision-Language Models (VLMs) in safety-critical domains like autonomous driving (AD) is critically hindered by reliability failures, most notably object hallucination. This failure stems from their reliance on ungrounded, text-based Chain-of-Thought (CoT) reasoning.While existing multi-modal CoT approaches attempt mitigation, they suffer from two fundamental flaws: (1) decoupled perception and reasoning stages that prevent end-to-end joint optimization, and (2) reliance on expensive, dense localization labels.Thus we introduce OmniDrive-R1, an end-to-end VLM framework designed for autonomous driving, which unifies perception and reasoning through an interleaved Multi-modal Chain-of-Thought (iMCoT) mechanism. Our core innovation is an Reinforcement-driven visual grounding capability, enabling the model to autonomously direct its attention and "zoom in" on critical regions for fine-grained analysis. This capability is enabled by our pure two-stage reinforcement learning training pipeline and Clip-GRPO algorithm. Crucially, Clip-GRPO introduces an annotation-free, process-based grounding reward. This reward not only eliminates the need for dense labels but also circumvents the instability of external tool calls by enforcing real-time cross-modal consistency between the visual focus and the textual reasoning. Extensive experiments on DriveLMM-o1 demonstrate our model's significant improvements. Compared to the baseline Qwen2.5VL-7B, OmniDrive-R1 improves the overall reasoning score from 51.77% to 80.35%, and the final answer accuracy from 37.81% to 73.62%.