TinyDrive: Multiscale Visual Question Answering with Selective Token Routing for Autonomous Driving

📅 2025-05-21

📈 Citations: 0

✨ Influential: 0

career value

202K/year

🤖 AI Summary

To address the challenges of limited computational resources and inefficient multi-view visual question answering (VQA) when deploying vision-language models (VLMs) on embedded automotive platforms, this paper proposes a lightweight and efficient VLM architecture. Methodologically, we introduce a novel multi-scale vision encoder that integrates cross-scale gating with token-level dynamic routing guided by importance weighting; additionally, we design a sequence-level dual-objective prioritization mechanism that jointly models prediction loss, epistemic uncertainty, and output diversity to optimize inference sequences. Our approach significantly reduces computational overhead while enhancing semantic understanding. On the DriveLM benchmark, it achieves state-of-the-art performance, improving BLEU-4 by 11.1% and METEOR by 35.4%, with a substantial reduction in model parameters.

Technology Category

Application Category

📝 Abstract

Vision Language Models (VLMs) employed for visual question-answering (VQA) in autonomous driving often require substantial computational resources that pose a challenge for their deployment in resource-constrained vehicles. To address this challenge, we introduce TinyDrive, a lightweight yet effective VLM for multi-view VQA in driving scenarios. Our model comprises two key components including a multiscale vision encoder and a dual-level prioritization mechanism for tokens and sequences. The multiscale encoder facilitates the processing of multi-view images at diverse resolutions through scale injection and cross-scale gating to generate enhanced visual representations. At the token level, we design a token routing mechanism that dynamically selects and process the most informative tokens based on learned importance scores. At the sequence level, we propose integrating normalized loss, uncertainty estimates, and a diversity metric to formulate sequence scores that rank and preserve samples within a sequence priority buffer. Samples with higher scores are more frequently selected for training. TinyDrive is first evaluated on our custom-curated VQA dataset, and it is subsequently tested on the public DriveLM benchmark, where it achieves state-of-the-art language understanding performance. Notably, it achieves relative improvements of 11.1% and 35.4% in BLEU-4 and METEOR scores, respectively, despite having a significantly smaller parameter count.

Problem

Research questions and friction points this paper is trying to address.

Reducing computational resources for VQA in autonomous driving

Enhancing multi-view image processing with multiscale vision encoding

Dynamic token selection based on importance for efficient VQA

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multiscale vision encoder with cross-scale gating

Token routing mechanism for dynamic token selection

Dual-level prioritization with sequence priority buffer

🔎 Similar Papers

MiniDrive: More Efficient Vision-Language Models with Multi-Level 2D Features as Text Tokens for Autonomous Driving