🤖 AI Summary
This work addresses the challenge that SMILES strings generated by large language models are often invalid due to violations of syntactic or chemical rules, and existing repair methods struggle to simultaneously preserve chemical validity and structural semantic fidelity. To overcome the limitations of conventional post-processing or single-point correction strategies, the authors propose a molecule-identity-preserving recovery paradigm that introduces a trajectory-level, multi-candidate exploration mechanism. This approach integrates RDKit-executable edits, molecular-aware similarity assessment, and proxy-guided multi-trajectory search to efficiently restore validity while maintaining the intended molecular semantics. Experiments on the ChEBI-20 invalid drafts demonstrate that the proposed method significantly outperforms current baselines across structural, exact-match, and string-level metrics, achieving state-of-the-art recovery performance.
📝 Abstract
Text-guided molecular generation with LLMs often yields invalid SMILES. We argue that invalid drafts should be addressed through a shift from validity-oriented repair to identity-preserving molecular recovery: the objective is not only to restore chemical validity, but also to preserve target-relevant structural cues and recover the molecular identity implied by the description. This perspective reveals the limitations of existing correction strategies. Post-hoc repair can recover validity while distorting key structures, LLM-only correction can introduce unintended global drift, and generic agentic correction remains constrained by greedy single-candidate trajectories even when equipped with executable RDKit edit tools. To address these limitations, we propose AMREC, which couples molecule-aware mismatch tracking with expanded candidate exploration and trajectory-level selection. On invalid ChEBI-20 drafts from three backbone models, AMREC achieves the strongest overall recovery profile across structural, exact-match, and string-level metrics.