🤖 AI Summary
This study addresses the lack of a standardized textual representation for molecules in large language models (LLMs) and the frequent oversight of how representation choice critically impacts model performance. The authors systematically evaluate nine molecular representations—including SMILES, InChI, IUPAC, and CML—across eight chemical tasks using sixteen diverse LLMs, encompassing general-purpose, reasoning-enhanced, and chemistry-specific models. Performance is assessed through generation quality (via LLM-as-a-judge), alongside mechanistic analyses such as tokenization audits, linear probing, and attention mapping. The work reveals, for the first time, a strong dependence of representation efficacy on task type: IUPAC excels in semantic and generative correctness, structured formats are better suited for structural tasks, and CML demonstrates the strongest overall performance. Based on these findings, the authors propose a task-aware representation routing strategy, challenging the prevailing “representation-agnostic” evaluation paradigm and uncovering fundamental differences in how representations are encoded mechanistically.
📝 Abstract
Large language models (LLMs) are increasingly used for molecular tasks, but it remains unclear which molecular representation to use. We present a systematic benchmark evaluating LLM molecular competence across nine representations and eight chemical tasks. We benchmark 16 LLMs across five model families, including reasoning and non-reasoning variants, chemistry-specialized LLMs, and closed frontier models. Performance is strongly representation-dependent and no single representation wins across tasks, though CML is the best, followed by MolJSON, InChI, and then canonical SMILES. Explicit structured text representations (CML and MolJSON) dominate structural tasks; IUPAC dominates semantic tasks, winning molecule retrieval for all 16 LLMs; and SMILES variants are rarely optimal despite their prevalence in pretraining. Chemistry-specialized models perform well with SMILES at the cost of large degradations with structured text representations, suggesting SMILES-only evaluation rewards specialization that does not generalize. Using LLM-as-a-judge, we find that IUPAC produces the highest fraction of correct molecule generations. A mechanistic study via tokenization audits, linear probes and attention shows that representations are encoded differently inside the model; for example, structured representations require higher attention across the molecular span. Our results argue against representation-invariant evaluation and motivate task-aware representation routing for LLM-based chemistry.