🤖 AI Summary
Existing 6-DoF grasp synthesis methods suffer from poor generalization, limiting their plug-and-play applicability across diverse robotic manipulators and real-world environments. To address this, we propose DiffusionTransformer—a novel architecture that for the first time integrates a discriminative module directly into the diffusion generative process, enabling joint optimization of generator and discriminator. Our approach combines object-centric geometric modeling with adversarial training to enhance both geometric plausibility and robustness of predicted grasp poses. Trained on a large-scale, self-collected synthetic dataset of 53 million samples, our method achieves state-of-the-art performance across multiple gripper types in simulation and attains top results on the FetchBench benchmark. Crucially, it demonstrates strong robustness to noisy visual inputs in real-robot experiments, validating its practical deployability.
📝 Abstract
Grasping is a fundamental robot skill, yet despite significant research advancements, learning-based 6-DOF grasping approaches are still not turnkey and struggle to generalize across different embodiments and in-the-wild settings. We build upon the recent success on modeling the object-centric grasp generation process as an iterative diffusion process. Our proposed framework, GraspGen, consists of a DiffusionTransformer architecture that enhances grasp generation, paired with an efficient discriminator to score and filter sampled grasps. We introduce a novel and performant on-generator training recipe for the discriminator. To scale GraspGen to both objects and grippers, we release a new simulated dataset consisting of over 53 million grasps. We demonstrate that GraspGen outperforms prior methods in simulations with singulated objects across different grippers, achieves state-of-the-art performance on the FetchBench grasping benchmark, and performs well on a real robot with noisy visual observations.