🤖 AI Summary
This work addresses the inefficiency of existing embodied intelligence approaches that indiscriminately increase test-time computational resources, resulting in high latency, elevated costs, and diminishing returns. The authors propose DIRECT, a dynamic routing framework that, for the first time, systematically analyzes the non-uniform computational utility across three key dimensions: chain-of-thought reasoning depth, model scale, and memory history. Leveraging these insights, DIRECT implements a lightweight multimodal routing strategy that allocates computation on demand. Integrated with vision-language models and multimodal perception, the method demonstrates strong performance on VLABench, RoboMME, and real-world experiments using a Franka robot on the DROID platform. In physical robot trials, it reduces average latency by up to 65% while matching or exceeding the success rates of significantly larger models, substantially enhancing the efficiency and practicality of embodied planning.
📝 Abstract
Vision-Language Models (VLMs) are increasingly deployed as high-level planners for embodied agents, with an emerging strategy of scaling test-time compute to improve capability. However, we observe that doing so increases latency, token usage, and FLOPs while yielding uneven, often diminishing gains in downstream success, limiting where embodied agents can be deployed. We argue that choosing when and where to spend test-time compute is central to bringing frontier performance to the real world. We introduce DIRECT, a routing framework that uses multimodal scene context to allocate compute per prompt, improving the success--cost Pareto frontier over fixed model selection. Across three dominant scaling axes, namely chain-of-thought depth, model size, and memory history, our experiments on VLABench and RoboMME show that test-time compute is not a uniform lever: different axes yield qualitatively distinct capability gains. We validate these insights on a physical Franka arm in a DROID setup spanning zero-shot manipulation and long-horizon chaining, where our router matches or exceeds a stronger model's success rate at up to 65% lower average latency. Ultimately, our results show that naively scaling test-time compute is wasteful, and that DIRECT can provide frontier-level embodied planning in robotic systems at a fraction of the cost. Project page can be found at jadee-dao.github.io/direct/.