π€ AI Summary
This work addresses the inefficiency and poor interpretability of existing genomic foundation models, which rely on implicit learning of conserved biological motifs. The authors propose Gengram, a novel module that introduces structured motif memory as a modeling paradigm, explicitly retrieving multi-nucleotide motifs through genome-specific hash encoding and a conditional memory mechanism to construct genomic βgrammar.β Integrated into mainstream genomic foundation model backbones, Gengram enables an efficient, biologically aligned retrieval-augmented architecture. Evaluated across multiple functional genomics tasks, the approach achieves performance gains of up to 14% while producing representations that align closely with established biological knowledge, thereby significantly enhancing both model generalization and mechanistic interpretability.
π Abstract
Current genomic foundation models (GFMs) rely on extensive neural computation to implicitly approximate conserved biological motifs from single-nucleotide inputs. We propose Gengram, a conditional memory module that introduces an explicit and highly efficient lookup primitive for multi-base motifs via a genomic-specific hashing scheme, establishing genomic"syntax". Integrated into the backbone of state-of-the-art GFMs, Gengram achieves substantial gains (up to 14%) across several functional genomics tasks. The module demonstrates robust architectural generalization, while further inspection of Gengram's latent space reveals the emergence of meaningful representations that align closely with fundamental biological knowledge. By establishing structured motif memory as a modeling primitive, Gengram simultaneously boosts empirical performance and mechanistic interpretability, providing a scalable and biology-aligned pathway for the next generation of GFMs. The code is available at https://github.com/zhejianglab/Genos, and the model checkpoint is available at https://huggingface.co/ZhejiangLab/Gengram.