🤖 AI Summary
Existing equivariant Transformers suffer from limited expressivity and generalization due to reliance on low-order features and local attention mechanisms. This work introduces the first globally SO(3)-equivariant attention mechanism grounded in irreducible representations of SO(3). It employs sparse Clebsch–Gordan convolutions to efficiently couple features of arbitrary order, reducing computational complexity to O(N log N). Crucially, it is the first to systematically integrate Clebsch–Gordan coefficients into the Transformer architecture, while enforcing strict SO(3) × S_N equivariance via weight sharing and group-equivariant data augmentation. Evaluated on n-body simulation, QM9, ModelNet, and robotic grasping tasks, our method consistently outperforms existing equivariant Transformers—achieving simultaneous improvements in prediction accuracy, GPU memory consumption, and inference speed.
📝 Abstract
The global attention mechanism is one of the keys to the success of transformer architecture, but it incurs quadratic computational costs in relation to the number of tokens. On the other hand, equivariant models, which leverage the underlying geometric structures of problem instance, often achieve superior accuracy in physical, biochemical, computer vision, and robotic tasks, at the cost of additional compute requirements. As a result, existing equivariant transformers only support low-order equivariant features and local context windows, limiting their expressiveness and performance. This work proposes Clebsch-Gordan Transformer, achieving efficient global attention by a novel Clebsch-Gordon Convolution on $SO(3)$ irreducible representations. Our method enables equivariant modeling of features at all orders while achieving ${O}(N log N)$ input token complexity. Additionally, the proposed method scales well with high-order irreducible features, by exploiting the sparsity of the Clebsch-Gordon matrix. Lastly, we also incorporate optional token permutation equivariance through either weight sharing or data augmentation. We benchmark our method on a diverse set of benchmarks including n-body simulation, QM9, ModelNet point cloud classification and a robotic grasping dataset, showing clear gains over existing equivariant transformers in GPU memory size, speed, and accuracy.