De novo molecular structure elucidation from mass spectra via flow matching

📅 2026-02-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Directly reconstructing complete small-molecule structures from mass spectrometry (MS) data is an ill-posed inverse problem, and conventional approaches suffer from limited accuracy. This work proposes MSFlow, the first two-stage framework to incorporate flow matching generative models into MS interpretation. In the first stage, a molecular formula-constrained Transformer maps the input spectrum into a continuous embedding rich in chemical information. The second stage employs a discrete flow matching decoder to generate valid molecular structures from this embedding. By integrating information-preserving molecular descriptors with a structure-aware generative mechanism, MSFlow achieves a Top-1 accuracy of 45% on small-molecule structure elucidation—up to 14 times higher than the current state-of-the-art methods.

Technology Category

Application Category

📝 Abstract
Mass spectrometry is a powerful and widely used tool for identifying molecular structures due to its sensitivity and ability to profile complex samples. However, translating spectra into full molecular structures is a difficult, under-defined inverse problem. Overcoming this problem is crucial for enabling biological insight, discovering new metabolites, and advancing chemical research across multiple fields. To this end, we develop MSFlow, a two-stage encoder-decoder flow-matching generative model that achieves state-of-the-art performance on the structure elucidation task for small molecules. In the first stage, we adopt a formula-restricted transformer model for encoding mass spectra into a continuous and chemically informative embedding space, while in the second stage, we train a decoder flow matching model to reconstruct molecules from latent embeddings of mass spectra. We present ablation studies demonstrating the importance of using information-preserving molecular descriptors for encoding mass spectra and motivate the use of our discrete flow-based decoder. Our rigorous evaluation demonstrates that MSFlow can accurately translate up to 45 percent of molecular mass spectra into their corresponding molecular representations - an improvement of up to fourteen-fold over the current state-of-the-art. A trained version of MSFlow is made publicly available on GitHub for non-commercial users.
Problem

Research questions and friction points this paper is trying to address.

molecular structure elucidation
mass spectrometry
inverse problem
de novo
Innovation

Methods, ideas, or system contributions that make the work stand out.

flow matching
mass spectrometry
molecular structure elucidation
generative model
transformer
🔎 Similar Papers
G
Ghaith Mqawass
TUM School of Life Sciences Weihenstephan, Technical University of Munich, Germany; Machine Learning and Computational Sciences, Pfizer Research & Development, Berlin, Germany
Tuan Le
Tuan Le
Senior Machine Learning Research Scientist
Geometric Deep LearningComputational ChemistryGenerative Modeling
F
Fabian Theis
TUM School of Life Sciences Weihenstephan, Technical University of Munich, Germany; TUM School of Computation, Information and Technology, Technical University of Munich, Germany; Institute of Computational Biology, Helmholtz Center Munich, Germany
Djork-Arné Clevert
Djork-Arné Clevert
Pfizer, VP, Machine Learning Research
Drug DiscoveryMachine LearningDeep LearningComputational ChemistryComputational Biology