🤖 AI Summary
This work addresses the vulnerability of diffusion model–driven robotic visuomotor policies to adversarial attacks during deployment and the limited precision of existing manipulation methods. To overcome these challenges, we propose TAKO, a novel approach that enables remote hijacking by dynamically switching reusable, universal adversarial patches in camera inputs to steer a frozen diffusion policy toward attacker-specified motion trajectories in real time. TAKO achieves, for the first time, real-time, precise, and composable adversarial takeover of diffusion-based policies. Our method further uncovers the failure mechanism of target policy matching under out-of-distribution transfer and establishes a general-purpose attack framework applicable across tasks, architectures, and generative paradigms. In four diverse tasks—2D manipulation, simulated airdrop, and simulated and real-world ground navigation—human operators achieved 100% takeover success rates.
📝 Abstract
Diffusion-based action generation has become a foundational component of embodied AI, but its reliance on visual conditioning leaves deployed visuomotor policies vulnerable to adversarial manipulation. Most prior attacks focus on disruption: they perturb the observation stream to reduce task success or induce erratic behavior. We study a stronger threat, Test-time Adversarial Takeover (TAKO), in which an attacker obtains a real-time steering interface over a frozen robot policy and turns it into a remotely piloted instrument. TAKO learns a small vocabulary of reusable universal patches through differentiable diffusion inference; at test time, the attacker switches among these patches in the camera stream to compose attacker-chosen trajectories. This works because the perturbation acts on the visual conditioning pathway, where the induced bias can persist through iterative generative inference. We further show that the natural targeted baseline, target-policy matching, fails because the victim policy cannot reliably supervise itself on out-of-distribution target shifts. Across four tasks (2D manipulation, simulated aerial delivery, simulated ground navigation, and physical-world ground navigation), two visual encoders (ResNet-18 and EfficientNet-B0 + Transformer), and three generative inference families (DDPM, DDIM, and flow matching), human operators achieve 100\% takeover success on attacker-defined objectives in every evaluated setting. The project page is available at https://tako-attack.github.io.