🤖 AI Summary
This work proposes a novel paradigm for efficiently tackling multimodal tasks involving video, audio, and other modalities without requiring native full-modality capabilities. By leveraging sandboxed tool invocation and code generation, complex audiovisual tasks are reformulated as information retrieval and reasoning problems solvable by text-to-image encoding agents. The study demonstrates for the first time that non-full-modality encoding agents can match or even surpass state-of-the-art full-modality models. Key contributions include a skill-injection mechanism, an open-source training framework named Code-X, a real-world benchmark called TerminalBench-O, and the OmniCoding trajectory dataset. Open-source baselines built upon Qwen-3.5-9B and Qwen-3.6-27B achieve competitive or superior performance across multiple audiovisual tasks compared to existing approaches.
📝 Abstract
As multimodal LLMs increasingly target video and audio, it is often assumed that such tasks require native omnimodal models. We show that this is not always the case: coding agents with only text+image access and a sandboxed tool-use interface can match, and in several settings outperform, SOTA native omnimodal models and predefined multimodal agent scaffolds across multiple audio-video benchmarks. Our trajectory analysis suggests that their strength comes from writing code and orchestrating tools to extract relevant evidence from transcripts, frames, and other modality signals, thereby converting omnimodal tasks into retrieval and information-processing problems rather than ingesting entire media streams. We further characterize their limitations through a failure taxonomy and process-level trace analysis, and show that simple skill injection, including human-written and self-distilled skills, substantially improves performance. To explore open-source elicitation, we introduce Code-X, a training recipe with the OmniCoding trajectory dataset and verifiable reward, and provide baselines on Qwen-3.5-9B and Qwen-3.6-27B. Finally, we argue that the next frontier is many-modality processing, and introduce TerminalBench-O, a process-level benchmark for real-world omnimodal processing tasks. Code will be available at https://github.com/Dongping-Chen/OmniCoding.