Transformers trained on proteins can learn to attend to Euclidean distance

📅 2025-02-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work investigates whether standard Transformers—without SE(3)-equivariant architectures or explicit structural encodings—can directly learn protein 3D spatial structure solely from atomic Cartesian coordinates. Method: We propose a coordinate-linear embedding scheme coupled with masked coordinate prediction for pretraining, and theoretically prove for the first time that Transformer self-attention can approximate 3D Gaussian kernels, endowing it with inherent Euclidean distance awareness. Contribution/Results: We validate end-to-end structure–language joint modeling on both synthetic point clouds and real protein structures. Experiments demonstrate that a plain Transformer, pretrained only with structural enhancement, significantly outperforms specialized 3D structure models across multiple downstream tasks—including structure prediction, binding affinity estimation, and functional annotation. These results establish the feasibility and superiority of standard Transformers as unified structure–language foundation models, bypassing the need for geometric priors or task-specific architectural modifications.

Technology Category

Application Category

📝 Abstract
While conventional Transformers generally operate on sequence data, they can be used in conjunction with structure models, typically SE(3)-invariant or equivariant graph neural networks (GNNs), for 3D applications such as protein structure modelling. These hybrids typically involve either (1) preprocessing/tokenizing structural features as input for Transformers or (2) taking Transformer embeddings and processing them within a structural representation. However, there is evidence that Transformers can learn to process structural information on their own, such as the AlphaFold3 structural diffusion model. In this work we show that Transformers can function independently as structure models when passed linear embeddings of coordinates. We first provide a theoretical explanation for how Transformers can learn to filter attention as a 3D Gaussian with learned variance. We then validate this theory using both simulated 3D points and in the context of masked token prediction for proteins. Finally, we show that pre-training protein Transformer encoders with structure improves performance on a downstream task, yielding better performance than custom structural models. Together, this work provides a basis for using standard Transformers as hybrid structure-language models.
Problem

Research questions and friction points this paper is trying to address.

Transformer Models
Protein Tertiary Structure
Direct Learning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Transformer Model
3D Coordinate Processing
Protein Structure Understanding
🔎 Similar Papers
No similar papers found.
Isaac Ellmen
Isaac Ellmen
DPhil Student, University of Oxford
immunoinformaticsmachine learning
Constantin Schneider
Constantin Schneider
Unknown affiliation
immunoinformaticsmachine learningXyme
M
Matthew I.J. Raybould
Department of Statistics, University of Oxford
C
Charlotte M. Deane
Department of Statistics, University of Oxford