TIDAL: Temporally Interleaved Diffusion and Action Loop for High-Frequency VLA Control

📅 2026-01-21
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge that large-scale vision-language-action (VLA) models, due to high inference latency, operate at low control frequencies and struggle to adapt to dynamic changes in task objectives. To overcome this limitation, the authors propose TIDAL, a novel temporally interleaved dual-frequency control framework. It features a low-frequency macro-intent loop that caches semantic embeddings and a high-frequency micro-control loop that interleaves flow integration with action output. By incorporating temporally offset training and a differential motion predictor, TIDAL effectively extends the temporal influence of semantic embeddings and substantially increases control frequency. Evaluated on edge devices, the method achieves approximately 9 Hz closed-loop control—compared to a baseline of 2.4 Hz—doubles performance on dynamic interception tasks, quadruples feedback frequency, and maintains robustness under non-stopped inference conditions.

Technology Category

Application Category

📝 Abstract
Large-scale Vision-Language-Action (VLA) models offer semantic generalization but suffer from high inference latency, limiting them to low-frequency batch-and-execute paradigm. This frequency mismatch creates an execution blind spot, causing failures in dynamic environments where targets move during the open-loop execution window. We propose TIDAL (Temporally Interleaved Diffusion and Action Loop), a hierarchical framework that decouples semantic reasoning from high-frequency actuation. TIDAL operates as a backbone-agnostic module for diffusion-based VLAs, using a dual-frequency architecture to redistribute the computational budget. Specifically, a low-frequency macro-intent loop caches semantic embeddings, while a high-frequency micro-control loop interleaves single-step flow integration with execution. This design enables approximately 9 Hz control updates on edge hardware (vs. approximately 2.4 Hz baselines) without increasing marginal overhead. To handle the resulting latency shift, we introduce a temporally misaligned training strategy where the policy learns predictive compensation using stale semantic intent alongside real-time proprioception. Additionally, we address the insensitivity of static vision encoders to velocity by incorporating a differential motion predictor. TIDAL is architectural, making it orthogonal to system-level optimizations. Experiments show a 2x performance gain over open-loop baselines in dynamic interception tasks. Despite a marginal regression in static success rates, our approach yields a 4x increase in feedback frequency and extends the effective horizon of semantic embeddings beyond the native action chunk size. Under non-paused inference protocols, TIDAL remains robust where standard baselines fail due to latency.
Problem

Research questions and friction points this paper is trying to address.

Vision-Language-Action
inference latency
dynamic environments
execution blind spot
high-frequency control
Innovation

Methods, ideas, or system contributions that make the work stand out.

temporal interleaving
dual-frequency control
diffusion-based VLA
predictive compensation
differential motion prediction
🔎 Similar Papers
No similar papers found.
Y
Yuteng Sun
Institute for Infocomm Research (I 2R), A*STAR, Singapore
H
Haoran Wang
Institute for Infocomm Research (I 2R), A*STAR, Singapore
R
Ruofei Bai
Institute for Infocomm Research (I 2R), A*STAR, Singapore
Zhengguo Li
Zhengguo Li
IEEE Fellow, Senior Principal Scientist, Institute for Infocomm Research
Video codingPhysics-guided AIComputational photographySensor fusionSwitched control
J
Jun Li
Institute for Infocomm Research (I 2R), A*STAR, Singapore
Meng Yee (Michael) Chuah
Meng Yee (Michael) Chuah
Senior Scientist, Institute for Infocomm Research, Agency for Science, Technology and Research
RoboticsReinforcement LearningBioinspirationForce Sensing
W
Wei Yun Yau
Institute for Infocomm Research (I 2R), A*STAR, Singapore