LiveVLN: Breaking the Stop-and-Go Loop in Vision-Language Navigation

📅 2026-04-21
📈 Citations: 0
Influential: 0
📄 PDF

career value

194K/year
🤖 AI Summary
This work addresses the stop-and-go behavior in existing vision-and-language navigation (VLN) systems, which arises from blocking waits in the perception–reasoning–action loop. The authors propose a training-free runtime framework that, for the first time, enables parallel processing of new observations while executing the current action. By introducing a multi-step action buffer and a dynamic handover mechanism, the method overlaps action execution with perception and reasoning. Compatible with pretrained VLN agents and requiring no additional training, the framework maintains competitive navigation performance on the R2R and RxR benchmarks while significantly improving execution smoothness and action availability: real-world average waiting time is reduced by up to 77.7%, and per-episode task duration decreases by 12.6%–19.6%.

Technology Category

Application Category

📝 Abstract
Recent navigation systems achieve strong benchmark results, yet real-world deployment often remains visibly stop-and-go. This bottleneck arises because the sense-inference-execution loop is still blocking: after each new observation, the controller must wait for sensing, transmission, and inference before motion can continue. Reducing action-generation cost alone therefore does not remove redundant waiting. To address this issue, we present LiveVLN, a training-free framework for more continuous embodied navigation by augmenting pretrained VLM navigators with multi-step action continuation. Instead of pausing for each full sense-and-inference round, LiveVLN overlaps execution with the processing of newly arrived observations, allowing refreshed future actions to be handed off before the current executable prefix is exhausted. This design keeps actions continuously available during motion, reducing idle waiting and enabling smoother online execution. The framework operates at runtime and can be integrated with compatible pretrained VLM navigators. Across R2R and RxR, LiveVLN preserves benchmark performance while reducing waiting time and improving action availability. In real-world deployments, it cuts average episode waiting time by up to $77.7\%$ and shortens wall-clock episode time by $12.6\%$ on StreamVLN and $19.6\%$ on NaVIDA, yielding more coherent execution during deployment. Code is available at https://github.com/NIneeeeeem/LiveVLN.
Problem

Research questions and friction points this paper is trying to address.

Vision-Language Navigation
stop-and-go
embodied navigation
action latency
real-world deployment
Innovation

Methods, ideas, or system contributions that make the work stand out.

LiveVLN
vision-language navigation
continuous execution
multi-step action continuation
training-free framework