TIDAL: Temporally Interleaved Diffusion and Action Loop for High-Frequency VLA Control

📅 2026-01-21

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work addresses the challenge that large-scale vision-language-action (VLA) models, due to high inference latency, operate at low control frequencies and struggle to adapt to dynamic changes in task objectives. To overcome this limitation, the authors propose TIDAL, a novel temporally interleaved dual-frequency control framework. It features a low-frequency macro-intent loop that caches semantic embeddings and a high-frequency micro-control loop that interleaves flow integration with action output. By incorporating temporally offset training and a differential motion predictor, TIDAL effectively extends the temporal influence of semantic embeddings and substantially increases control frequency. Evaluated on edge devices, the method achieves approximately 9 Hz closed-loop control—compared to a baseline of 2.4 Hz—doubles performance on dynamic interception tasks, quadruples feedback frequency, and maintains robustness under non-stopped inference conditions.

Technology Category

Application Category

📝 Abstract

Large-scale Vision-Language-Action (VLA) models offer semantic generalization but suffer from high inference latency, limiting them to low-frequency batch-and-execute paradigm. This frequency mismatch creates an execution blind spot, causing failures in dynamic environments where targets move during the open-loop execution window. We propose TIDAL (Temporally Interleaved Diffusion and Action Loop), a hierarchical framework that decouples semantic reasoning from high-frequency actuation. TIDAL operates as a backbone-agnostic module for diffusion-based VLAs, using a dual-frequency architecture to redistribute the computational budget. Specifically, a low-frequency macro-intent loop caches semantic embeddings, while a high-frequency micro-control loop interleaves single-step flow integration with execution. This design enables approximately 9 Hz control updates on edge hardware (vs. approximately 2.4 Hz baselines) without increasing marginal overhead. To handle the resulting latency shift, we introduce a temporally misaligned training strategy where the policy learns predictive compensation using stale semantic intent alongside real-time proprioception. Additionally, we address the insensitivity of static vision encoders to velocity by incorporating a differential motion predictor. TIDAL is architectural, making it orthogonal to system-level optimizations. Experiments show a 2x performance gain over open-loop baselines in dynamic interception tasks. Despite a marginal regression in static success rates, our approach yields a 4x increase in feedback frequency and extends the effective horizon of semantic embeddings beyond the native action chunk size. Under non-paused inference protocols, TIDAL remains robust where standard baselines fail due to latency.

Problem

Research questions and friction points this paper is trying to address.

Vision-Language-Action

inference latency

dynamic environments

execution blind spot

high-frequency control

Innovation

Methods, ideas, or system contributions that make the work stand out.

temporal interleaving

dual-frequency control

diffusion-based VLA

predictive compensation

differential motion prediction

🔎 Similar Papers

No similar papers found.

Authors to Follow