🤖 AI Summary
This work addresses the high computational cost of traditional organic crystal structure prediction (CSP) methods, which hinders large-scale screening. The authors propose Clari, a novel model that uniquely integrates pure pairwise-biased attention with flow matching to directly generate non-redundant unit cell structures from only atomic types and chemical bonds—eliminating the need for triangular layers or post-generation relaxation. Clari accommodates molecules beyond RDKit’s scope, such as fullerenes and metal complexes, and incorporates self-conditioning, explicit hydrogen modeling, and energy-based ranking. Evaluated on the OXtal benchmark, Clari achieves higher solution rates than baseline methods and accelerates single-molecule generation by 15–30×; even with energy-based filtering, it retains a 5–8× speedup, enabling second-scale CSP. The study also introduces a new test set, the CSD Teaching Subset, to facilitate future method development.
📝 Abstract
Organic crystal structure prediction (CSP) is a requirement for computational modelling of organic solids, but traditionally costs several CPU-years per molecule. Generative models such as OXtal dramatically reduce this cost by sampling stable organic crystal structures directly. However, OXtal forgoes explicit lattice parametrization in favour of modelling large crops of the bulk material with expensive triangle layers, which can incur a computational cost of minutes per molecule. In this paper, we reduce this to seconds with Clari, a large-scale flow matching model that generates redundancy-free unit cells and replaces triangle layers with pure pair-bias attention. Clari requires only atom types and bonds as input and does not need an RDKit-sanitizable input molecule, which expands its applicability to challenging chemistries such as fullerenes, metal complexes, and atom clusters. We further ablate key design choices such as auxiliary losses, timestep distributions, noise priors, and self-conditioning. On OXtal's test sets, we surpass OXtal's solve rate while obtaining a speedup of $15$-$30\times$. Because Clari also models explicit hydrogens, it supports inference-time scaling via direct energy ranking, without any decoration or relaxation step. When generating 150 crystals and selecting the top-30 by energy, we further improve solve rate while maintaining a speedup of $5$-$8\times$. We also introduce the CSD Teaching Subset as a new test split of diverse and complex molecules for future benchmarking. Our contributions enable CSP within seconds, making large-scale virtual screening of organic solids practical. Code is available at https://github.com/aspuru-guzik-group/clari.