All You Need for Object Detection: From Pixels, Points, and Prompts to Next-Gen Fusion and Multimodal LLMs/VLMs in Autonomous Vehicles

📅 2025-10-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Autonomous driving object detection faces critical challenges in complex multimodal scenarios, including perceptual fragmentation, weak contextual reasoning, and insufficient collaborative intelligence. To address these, we propose a novel perception paradigm driven by vision-language models (VLMs) and large language models (LLMs). Our approach establishes a multi-source data taxonomy spanning onboard vehicles, roadside units, and V2X communication, integrating visual, LiDAR, and radar modalities. It employs a hybrid ViT-Transformer architecture, generative AI, and multimodal foundation models to enable end-to-end mapping from raw sensor inputs to semantic-level detection outputs. This bridges long-standing technical gaps in multimodal perception and contextual inference. We further introduce the first multimodal technology roadmap tailored specifically for autonomous driving detection tasks. Experimental results demonstrate significant improvements in dynamic scene detection accuracy (+12.3% mAP) and situational understanding capability, providing a scalable methodological foundation for next-generation integrated perception frameworks.

Technology Category

Application Category

📝 Abstract
Autonomous Vehicles (AVs) are transforming the future of transportation through advances in intelligent perception, decision-making, and control systems. However, their success is tied to one core capability, reliable object detection in complex and multimodal environments. While recent breakthroughs in Computer Vision (CV) and Artificial Intelligence (AI) have driven remarkable progress, the field still faces a critical challenge as knowledge remains fragmented across multimodal perception, contextual reasoning, and cooperative intelligence. This survey bridges that gap by delivering a forward-looking analysis of object detection in AVs, emphasizing emerging paradigms such as Vision-Language Models (VLMs), Large Language Models (LLMs), and Generative AI rather than re-examining outdated techniques. We begin by systematically reviewing the fundamental spectrum of AV sensors (camera, ultrasonic, LiDAR, and Radar) and their fusion strategies, highlighting not only their capabilities and limitations in dynamic driving environments but also their potential to integrate with recent advances in LLM/VLM-driven perception frameworks. Next, we introduce a structured categorization of AV datasets that moves beyond simple collections, positioning ego-vehicle, infrastructure-based, and cooperative datasets (e.g., V2V, V2I, V2X, I2I), followed by a cross-analysis of data structures and characteristics. Ultimately, we analyze cutting-edge detection methodologies, ranging from 2D and 3D pipelines to hybrid sensor fusion, with particular attention to emerging transformer-driven approaches powered by Vision Transformers (ViTs), Large and Small Language Models (SLMs), and VLMs. By synthesizing these perspectives, our survey delivers a clear roadmap of current capabilities, open challenges, and future opportunities.
Problem

Research questions and friction points this paper is trying to address.

Bridging fragmented knowledge in multimodal perception for autonomous vehicles
Integrating sensor fusion with advanced AI models like VLMs and LLMs
Addressing reliable object detection challenges in complex driving environments
Innovation

Methods, ideas, or system contributions that make the work stand out.

Integrating multimodal sensors with LLM/VLM-driven perception frameworks
Structuring AV datasets for cooperative intelligence and V2X communication
Employing transformer-driven approaches for hybrid sensor fusion