🤖 AI Summary
Agricultural robots suffer from limited autonomous navigation capability due to the absence of vision-language navigation (VLN) benchmarks and methods tailored to the unstructured, dynamic environments of farmland. To address this, we introduce A2A—the first agricultural-specific VLN benchmark—and AgriVLN, a dedicated baseline model. Methodologically, AgriVLN features: (1) a Subtask List (STL) module that decomposes natural language instructions into fine-grained, executable subtasks, enhancing robustness in long-instruction execution; and (2) an end-to-end framework integrating a customized prompt-engineered vision-language model (VLM) with RGB video inputs to directly generate low-level control actions. Evaluated on the A2A benchmark, AgriVLN achieves a navigation success rate (SR) of 0.47—marking a substantial improvement over the baseline SR of 0.33 and outperforming general-purpose VLN approaches. This work establishes a reproducible benchmark and a novel paradigm for vision-language-driven autonomous navigation in agriculture.
📝 Abstract
Agricultural robots have emerged as powerful members in agricultural tasks, nevertheless, still heavily rely on manual operation or untransportable railway for movement, resulting in limited mobility and poor adaptability. Vision-and-Language Navigation (VLN) enables robots to navigate to the target destinations following natural language instructions, demonstrating strong performance on several domains. However, none of the existing benchmarks or methods is specifically designed for agricultural scenes. To bridge this gap, we propose Agriculture to Agriculture (A2A) benchmark, containing 1,560 episodes across six diverse agricultural scenes, in which all realistic RGB videos are captured by front-facing camera on a quadruped robot at a height of 0.38 meters, aligning with the practical deployment conditions. Meanwhile, we propose Vision-and-Language Navigation for Agricultural Robots (AgriVLN) baseline based on Vision-Language Model (VLM) prompted with carefully crafted templates, which can understand both given instructions and agricultural environments to generate appropriate low-level actions for robot control. When evaluated on A2A, AgriVLN performs well on short instructions but struggles with long instructions, because it often fails to track which part of the instruction is currently being executed. To address this, we further propose Subtask List (STL) instruction decomposition module and integrate it into AgriVLN, improving Success Rate (SR) from 0.33 to 0.47. We additionally compare AgriVLN with several existing VLN methods, demonstrating the state-of-the-art performance in the agricultural domain.