PointAction: 3D Points as Universal Action Representations for Robot Control

📅 2026-06-02

📈 Citations: 0

✨ Influential: 0

career value

197K/year

🤖 AI Summary

Existing RGB-only video prediction struggles to accurately model 3D motion, contact geometry, and spatial constraints, leading to ambiguous robot action execution and high supervision costs for cross-task and cross-morphology generalization. This work proposes using dynamic 3D point graphs as a structured interface to fine-tune video diffusion models for jointly predicting future RGB frames and 4D point cloud dynamics, coupled with a diffusion-based action decoder that generates executable actions. For the first time, metric-aware 3D point dynamics are employed as a universal, morphology-agnostic action representation, substantially reducing action grounding ambiguity and enabling effective cross-task and cross-embodiment transfer with minimal action supervision. The method achieves state-of-the-art 4D generation quality in simulation and successfully generalizes to two previously unseen real-world robotic arms.

📝 Abstract

Video-Action Models (VAMs) leverage the broad visual dynamics captured by pre-trained video diffusion models, offering a promising path toward generalizable robot manipulation. However, RGB-only video rollouts are not directly actionable: they leave metric 3D motion, contact geometry, and fine-grained spatial constraints under-specified, making action grounding ambiguous. Meanwhile, scaling action supervision across diverse tasks and embodiments remains costly. We present PointAction, a framework that bridges video predictions to robot actions through explicit point-based 4D modeling. PointAction fine-tunes a foundation video generation model to jointly predict future RGB frames and dynamic 3D pointmaps, producing temporally consistent 3D motion of task-relevant scene geometry. These point dynamics serve as a structured, embodiment-agnostic action interface, which a diffusion-based action decoder maps to executable robot actions. By using metric 3D point dynamics as the interface between video prediction and control, PointAction reduces the ambiguity of RGB-only action grounding and supports transfer across tasks and embodiments with limited action supervision. Experiments show that PointAction achieves state-of-the-art 4D generation quality on robot scenes, outperforms existing baselines in simulation, and generalizes to two real robot arms unseen during pretraining.

Problem

Research questions and friction points this paper is trying to address.

robot control

action grounding

3D motion

embodiment transfer

video-action models

Innovation

Methods, ideas, or system contributions that make the work stand out.

PointAction

4D point dynamics

video diffusion models