Unified Video-Action Joint Denoising for Dexterous Action and Data Generation

📅 2026-06-02
📈 Citations: 0
Influential: 0
📄 PDF

career value

197K/year
🤖 AI Summary
Existing approaches struggle to simultaneously achieve action diversity and video consistency when aligning visual dynamics priors with executable robotic motions. This work proposes Donk, a model that constructs a joint distribution space over videos and dexterous hand trajectories, enabling unified multimodal conditional generation—conditioned on language, initial images, or MANO hand states—within a single denoising framework. Notably, Donk also supports text-driven generation without image conditions. The method significantly improves the accuracy of dexterous hand trajectories and video fidelity while enabling autoregressive sampling. It achieves state-of-the-art performance across diverse generation tasks, including those conditioned on actions, videos, or text alone.
📝 Abstract
Recent world action models leverage video foundation models by aligning broad visual-dynamics priors with executable robot actions. We revisit this alignment from a distributional perspective. Existing formulations typically narrow the aligned prior into an observation-conditioned policy distribution over future actions. In contrast, we keep the distribution broader by modeling the joint space of interaction videos and executable hand trajectories under multiple conditioning regimes. We propose Donk, a unified video-action denoising model for dexterous hands. With language, an initial image, and the initial hand state, Donk samples future videos and bimanual MANO trajectories as an action policy. Without the image condition, the same denoising architecture samples paired video-action rollouts from a text-conditioned distribution, turning the aligned video prior into a data engine. Across action, video, and text-only generation evaluations, Donk improves dexterous trajectory accuracy, preserves strong video fidelity, and produces smooth text-conditioned action rollouts under the same unified training recipe.
Problem

Research questions and friction points this paper is trying to address.

video-action alignment
dexterous manipulation
joint distribution
action generation
video foundation models
Innovation

Methods, ideas, or system contributions that make the work stand out.

joint video-action denoising
dexterous manipulation
distributional alignment
text-conditioned generation
unified generative model