π€ AI Summary
This work addresses the limitations of existing automatic train operation systems, which lack efficient and interpretable visual cognition capabilities in complex scenarios and suffer from the absence of domain-specific evaluation benchmarks from the driverβs cab perspective. To bridge this gap, we introduce RailVQA-bench, the first railway visual question answering benchmark, and propose RailVQA-CoM, a lightweight, plug-and-play collaborative framework that synergizes large and small models. Our approach features a transparent three-module architecture and adaptive temporal sampling to effectively combine the cognitive strengths of large models with the inference efficiency of compact models. This design significantly enhances perceptual generalization, cross-domain adaptability, and model interpretability while reducing latency, enabling seamless integration into autonomous driving systems as an off-the-shelf component.
π Abstract
Automatic Train Operation (ATO) relies on low-latency, reliable cab-view visual perception and decision-oriented inference to ensure safe operation in complex and dynamic railway environments. However, existing approaches focus primarily on basic perception and often generalize poorly to rare yet safety-critical corner cases. They also lack the high-level reasoning and planning capabilities required for operational decision-making. Although recent Large Multi-modal Models (LMMs) show strong generalization and cognitive capabilities, their use in safety-critical ATO is hindered by high computational cost and hallucination risk. Meanwhile, reliable domain-specific benchmarks for systematically evaluating cognitive capabilities are still lacking. To address these gaps, we introduce RailVQA-bench, the first VQA benchmark for cab-view visual cognition in ATO, comprising 20,000 single-frame and 1,168 video based QA pairs to evaluate cognitive generalization and interpretability in both static and dynamic scenarios. Furthermore, we propose RailVQA-CoM, a collaborative large-small model framework that combines small-model efficiency with large-model cognition via a transparent three-module architecture and adaptive temporal sampling, improving perceptual generalization and enabling efficient reasoning and planning. Experiments demonstrate that the proposed approach substantially improves performance, enhances interpretability, reduces inference latency, and strengthens cross-domain generalization, while enabling plug-and-play deployment in autonomous driving systems. Code and datasets will be available at https://github.com/Cybereye-bjtu/RailVQA.