Augmenting Molecular Language Models with Local $n$-gram Memory

📅 2026-06-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge that molecular language models face when processing SMILES strings: character-level tokenization disrupts chemically meaningful local substructures, hindering effective modeling of long-range dependencies. To overcome this limitation, the authors propose MolGram, a conditional n-gram memory module that maps recurring local string patterns to learnable embeddings without altering the standard tokenizer. These pattern embeddings are dynamically injected into the Transformer’s hidden states via a context-aware mechanism, introducing explicit local structural memory as an efficient inductive bias. Evaluated across unconditional molecular generation, forward reaction prediction, and single-step retrosynthesis tasks, MolGram consistently outperforms baseline models—achieving results comparable to or better than those of models with three times its parameter count.
📝 Abstract
Transformer-based language models for SMILES strings suffer from a locality gap: standard character-level tokenization fragments chemically meaningful motifs, forcing models to repeatedly learn local syntax at the expense of long-range dependencies. To address this without disrupting standard tokenizers, we propose MolGram, which integrates a conditional $n$-gram memory module into molecular language models. MolGram maps local string patterns to learned embeddings via scalable hash lookups and dynamically injects this regional context into hidden states. Evaluations across three tasks, including unconditional molecule generation, forward reaction prediction, and single-step retrosynthesis, show that MolGram consistently improves performance. Crucially, our analyses demonstrate that MolGram outperforms baselines with 3$\times$ more parameters, establishing explicit local pattern memory as a highly efficient inductive bias.
Problem

Research questions and friction points this paper is trying to address.

locality gap
molecular language models
SMILES tokenization
long-range dependencies
chemically meaningful motifs
Innovation

Methods, ideas, or system contributions that make the work stand out.

MolGram
n-gram memory
molecular language models
SMILES tokenization
inductive bias