🤖 AI Summary
This study elucidates how protein language models detect both exact and approximate repetitive segments in sequences and the underlying mechanisms involved. By analyzing model behavior in masked token prediction tasks and employing interpretability techniques—including attention head dissection, neuron activation tracing, and induction head identification—the authors demonstrate that models jointly leverage generic positional attention and biologically specialized components, such as neurons encoding amino acid similarity, to construct repeat representations. Induction heads align repeated regions to enhance prediction accuracy. The work further reveals, for the first time, that the mechanism for detecting approximate repeats subsumes and extends that for exact repeats, uncovering a two-stage process wherein the model integrates linguistic pattern matching with biological priors. These findings provide a foundation for understanding the model’s capacity to emulate complex biological evolutionary processes.
📝 Abstract
Protein sequences are abundant in repeating segments, both as exact copies and as approximate segments with mutations. These repeats are important for protein structure and function, motivating decades of algorithmic work on repeat identification. Recent work has shown that protein language models (PLMs) identify repeats, by examining their behavior in masked-token prediction. To elucidate their internal mechanisms, we investigate how PLMs detect both exact and approximate repeats. We find that the mechanism for approximate repeats functionally subsumes that of exact repeats. We then characterize this mechanism, revealing two main stages: PLMs first build feature representations using both general positional attention heads and biologically specialized components, such as neurons that encode amino-acid similarity. Then, induction heads attend to aligned tokens across repeated segments, promoting the correct answer. Our results reveal how PLMs solve this biological task by combining language-based pattern matching with specialized biological knowledge, thereby establishing a basis for studying more complex evolutionary processes in PLMs.