🤖 AI Summary
To address the challenges of limited computational resources and inefficient multi-view visual question answering (VQA) when deploying vision-language models (VLMs) on embedded automotive platforms, this paper proposes a lightweight and efficient VLM architecture. Methodologically, we introduce a novel multi-scale vision encoder that integrates cross-scale gating with token-level dynamic routing guided by importance weighting; additionally, we design a sequence-level dual-objective prioritization mechanism that jointly models prediction loss, epistemic uncertainty, and output diversity to optimize inference sequences. Our approach significantly reduces computational overhead while enhancing semantic understanding. On the DriveLM benchmark, it achieves state-of-the-art performance, improving BLEU-4 by 11.1% and METEOR by 35.4%, with a substantial reduction in model parameters.
📝 Abstract
Vision Language Models (VLMs) employed for visual question-answering (VQA) in autonomous driving often require substantial computational resources that pose a challenge for their deployment in resource-constrained vehicles. To address this challenge, we introduce TinyDrive, a lightweight yet effective VLM for multi-view VQA in driving scenarios. Our model comprises two key components including a multiscale vision encoder and a dual-level prioritization mechanism for tokens and sequences. The multiscale encoder facilitates the processing of multi-view images at diverse resolutions through scale injection and cross-scale gating to generate enhanced visual representations. At the token level, we design a token routing mechanism that dynamically selects and process the most informative tokens based on learned importance scores. At the sequence level, we propose integrating normalized loss, uncertainty estimates, and a diversity metric to formulate sequence scores that rank and preserve samples within a sequence priority buffer. Samples with higher scores are more frequently selected for training. TinyDrive is first evaluated on our custom-curated VQA dataset, and it is subsequently tested on the public DriveLM benchmark, where it achieves state-of-the-art language understanding performance. Notably, it achieves relative improvements of 11.1% and 35.4% in BLEU-4 and METEOR scores, respectively, despite having a significantly smaller parameter count.