Fast Organic Crystal Structure Prediction with Unit Cell Flow Matching

📅 2026-06-02
📈 Citations: 0
Influential: 0
📄 PDF

career value

170K/year
🤖 AI Summary
This work addresses the high computational cost of traditional organic crystal structure prediction (CSP) methods, which hinders large-scale screening. The authors propose Clari, a novel model that uniquely integrates pure pairwise-biased attention with flow matching to directly generate non-redundant unit cell structures from only atomic types and chemical bonds—eliminating the need for triangular layers or post-generation relaxation. Clari accommodates molecules beyond RDKit’s scope, such as fullerenes and metal complexes, and incorporates self-conditioning, explicit hydrogen modeling, and energy-based ranking. Evaluated on the OXtal benchmark, Clari achieves higher solution rates than baseline methods and accelerates single-molecule generation by 15–30×; even with energy-based filtering, it retains a 5–8× speedup, enabling second-scale CSP. The study also introduces a new test set, the CSD Teaching Subset, to facilitate future method development.
📝 Abstract
Organic crystal structure prediction (CSP) is a requirement for computational modelling of organic solids, but traditionally costs several CPU-years per molecule. Generative models such as OXtal dramatically reduce this cost by sampling stable organic crystal structures directly. However, OXtal forgoes explicit lattice parametrization in favour of modelling large crops of the bulk material with expensive triangle layers, which can incur a computational cost of minutes per molecule. In this paper, we reduce this to seconds with Clari, a large-scale flow matching model that generates redundancy-free unit cells and replaces triangle layers with pure pair-bias attention. Clari requires only atom types and bonds as input and does not need an RDKit-sanitizable input molecule, which expands its applicability to challenging chemistries such as fullerenes, metal complexes, and atom clusters. We further ablate key design choices such as auxiliary losses, timestep distributions, noise priors, and self-conditioning. On OXtal's test sets, we surpass OXtal's solve rate while obtaining a speedup of $15$-$30\times$. Because Clari also models explicit hydrogens, it supports inference-time scaling via direct energy ranking, without any decoration or relaxation step. When generating 150 crystals and selecting the top-30 by energy, we further improve solve rate while maintaining a speedup of $5$-$8\times$. We also introduce the CSD Teaching Subset as a new test split of diverse and complex molecules for future benchmarking. Our contributions enable CSP within seconds, making large-scale virtual screening of organic solids practical. Code is available at https://github.com/aspuru-guzik-group/clari.
Problem

Research questions and friction points this paper is trying to address.

organic crystal structure prediction
computational cost
generative models
complex chemistries
molecular input constraints
Innovation

Methods, ideas, or system contributions that make the work stand out.

flow matching
unit cell generation
pair-bias attention
organic crystal structure prediction
large-scale generative model
🔎 Similar Papers
No similar papers found.
💼 Related Jobs
Alston Lo
Alston Lo
MIT
Machine Learning
L
Luka Mucko
University of Zagreb, Faculty of Electrical Engineering and Computing, Zagreb, Croatia
A
Austin H. Cheng
Department of Chemistry, University of Toronto, Toronto, ON, Canada
A
Andy Cai
Department of Chemistry, University of Toronto, Toronto, ON, Canada
A
Alastair J. A. Price
Department of Chemistry, University of Toronto, Toronto, ON, Canada
Wojciech Matusik
Wojciech Matusik
MIT
Computer GraphicsDigital FabricationComputational Design
A
Alán Aspuru-Guzik
Department of Chemistry, University of Toronto, Toronto, ON, Canada