Augmenting Molecular Language Models with Local $n$-gram Memory

📅 2026-06-10

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work addresses the challenge that molecular language models face when processing SMILES strings: character-level tokenization disrupts chemically meaningful local substructures, hindering effective modeling of long-range dependencies. To overcome this limitation, the authors propose MolGram, a conditional n-gram memory module that maps recurring local string patterns to learnable embeddings without altering the standard tokenizer. These pattern embeddings are dynamically injected into the Transformer’s hidden states via a context-aware mechanism, introducing explicit local structural memory as an efficient inductive bias. Evaluated across unconditional molecular generation, forward reaction prediction, and single-step retrosynthesis tasks, MolGram consistently outperforms baseline models—achieving results comparable to or better than those of models with three times its parameter count.

📝 Abstract

Transformer-based language models for SMILES strings suffer from a locality gap: standard character-level tokenization fragments chemically meaningful motifs, forcing models to repeatedly learn local syntax at the expense of long-range dependencies. To address this without disrupting standard tokenizers, we propose MolGram, which integrates a conditional $n$-gram memory module into molecular language models. MolGram maps local string patterns to learned embeddings via scalable hash lookups and dynamically injects this regional context into hidden states. Evaluations across three tasks, including unconditional molecule generation, forward reaction prediction, and single-step retrosynthesis, show that MolGram consistently improves performance. Crucially, our analyses demonstrate that MolGram outperforms baselines with 3$\times$ more parameters, establishing explicit local pattern memory as a highly efficient inductive bias.

Problem

Research questions and friction points this paper is trying to address.

locality gap

molecular language models

SMILES tokenization

long-range dependencies

chemically meaningful motifs

Innovation

Methods, ideas, or system contributions that make the work stand out.

MolGram

n-gram memory

molecular language models