AppleVLM: End-to-End Autonomous Driving With Advanced Perception and Planning-Enhanced Vision-Language Models

📅 2026-02-04
🏛️ IEEE transactions on intelligent transportation systems (Print)
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses critical limitations of existing end-to-end vision-language models in autonomous driving—namely, insufficient lane perception, language interpretation inaccuracies, and poor robustness in extreme scenarios. To overcome these challenges, we propose AppleVLM, a novel vision-language architecture that jointly enhances perception and planning by integrating multi-view spatiotemporal imagery with an explicit bird’s-eye-view planning modality. Our approach introduces three key innovations: a deformable Transformer-based visual encoder, a tri-modal alignment mechanism bridging vision, language, and planning, and a hierarchical Chain-of-Thought fine-tuning strategy. Experimental results demonstrate that AppleVLM achieves state-of-the-art performance on two CARLA benchmarks and successfully enables end-to-end autonomous driving in complex outdoor environments on a real-world AGV platform.

Technology Category

Application Category

📝 Abstract
End-to-end autonomous driving has emerged as a promising paradigm integrating perception, decision-making, and control within a unified learning framework. Recently, Vision-Language Models (VLMs) have gained significant attention for their potential to enhance the robustness and generalization of end-to-end driving models in diverse and unseen scenarios. However, existing VLM-based approaches still face challenges, including suboptimal lane perception, language understanding biases, and difficulties in handling corner cases. To address these issues, we propose AppleVLM, an advanced perception and planning-enhanced VLM model for robust end-to-end driving. AppleVLM introduces a novel vision encoder and a planning strategy encoder to improve perception and decision-making. Firstly, the vision encoder fuses spatial-temporal information from multi-view images across multiple timesteps using a deformable transformer mechanism, enhancing robustness to camera variations and facilitating scalable deployment across different vehicle platforms. Secondly, unlike traditional VLM-based approaches, AppleVLM introduces a dedicated planning modality that encodes explicit Bird's-Eye-View spatial information, mitigating language biases in navigation instructions. Finally, a VLM decoder fine-tuned by a hierarchical Chain-of-Thought integrates vision, language, and planning features to output robust driving waypoints. We evaluate AppleVLM in closed-loop experiments on two CARLA benchmarks, achieving state-of-the-art driving performance. Furthermore, we deploy AppleVLM on an AGV platform and successfully showcase real-world end-to-end autonomous driving in complex outdoor environments.
Problem

Research questions and friction points this paper is trying to address.

autonomous driving
vision-language models
lane perception
language bias
corner cases
Innovation

Methods, ideas, or system contributions that make the work stand out.

Vision-Language Model
End-to-end Autonomous Driving
Planning-enhanced Perception
Deformable Transformer
Bird's-Eye-View Planning
🔎 Similar Papers
No similar papers found.
Yuxuan Han
Yuxuan Han
Tsinghua University
computer visioncomputer graphics
K
Kunyuan Wu
Shenzhen Key Lab for Advanced Motion Control and Modern Automation Equipments, Guangdong Provincial Key Laboratory of Intelligent Morphing Mechanisms and Adaptive Robotics, School of Intelligence Science and Engineering, Harbin Institute of Technology, Shenzhen, China
Q
Qianyi Shao
Shenzhen Key Lab for Advanced Motion Control and Modern Automation Equipments, Guangdong Provincial Key Laboratory of Intelligent Morphing Mechanisms and Adaptive Robotics, School of Intelligence Science and Engineering, Harbin Institute of Technology, Shenzhen, China
R
Renxiang Xiao
Shenzhen Key Lab for Advanced Motion Control and Modern Automation Equipments, Guangdong Provincial Key Laboratory of Intelligent Morphing Mechanisms and Adaptive Robotics, School of Intelligence Science and Engineering, Harbin Institute of Technology, Shenzhen, China
Z
Zilu Wang
Shenzhen Key Lab for Advanced Motion Control and Modern Automation Equipments, Guangdong Provincial Key Laboratory of Intelligent Morphing Mechanisms and Adaptive Robotics, School of Intelligence Science and Engineering, Harbin Institute of Technology, Shenzhen, China
Cansen Jiang
Cansen Jiang
PhD, Le2i, CNRS, Université de Bourgogne
Computer VisionSLAMCamera Calibration
Yi Xiao
Yi Xiao
Shenzhen MSU-BIT University
Compositesimage processing
Liang Hu
Liang Hu
Professor, Harbin Institute of Technology, Shenzhen
State Estimation and SLAMNavigation and ControlAutonomous Systems
Y
Yunjiang Lou
Shenzhen Key Lab for Advanced Motion Control and Modern Automation Equipments, Guangdong Provincial Key Laboratory of Intelligent Morphing Mechanisms and Adaptive Robotics, School of Intelligence Science and Engineering, Harbin Institute of Technology, Shenzhen, China