Contrastive Domain Generalization for Cross-Instrument Molecular Identification in Mass Spectrometry

📅 2026-01-31
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limited generalization of molecular identification from mass spectrometry data, particularly under unseen molecular scaffolds, which stems from the semantic gap between physical spectral peaks and chemical structural semantics. To bridge this gap, we propose the first cross-modal alignment framework that directly maps mass spectra into the structural embedding space of a pretrained chemical language model. By explicitly integrating spectral signals with chemical semantics, our approach transcends conventional closed-set recognition paradigms and enables open-set molecular identification and cross-instrument generalization. Evaluated on a rigorous scaffold-split benchmark, the method achieves 42.2% Top-1 accuracy in 256-way zero-shot retrieval and 95.4% accuracy in 5-way 5-shot molecular re-identification, demonstrating strong chemical consistency and generalization capability in the learned embedding space.

Technology Category

Application Category

📝 Abstract
Identifying molecules from mass spectrometry (MS) data remains a fundamental challenge due to the semantic gap between physical spectral peaks and underlying chemical structures. Existing deep learning approaches often treat spectral matching as a closed-set recognition task, limiting their ability to generalize to unseen molecular scaffolds. To overcome this limitation, we propose a cross-modal alignment framework that directly maps mass spectra into the chemically meaningful molecular structure embedding space of a pretrained chemical language model. On a strict scaffold-disjoint benchmark, our model achieves a Top-1 accuracy of 42.2% in fixed 256-way zero-shot retrieval and demonstrates strong generalization under a global retrieval setting. Moreover, the learned embedding space demonstrates strong chemical coherence, reaching 95.4% accuracy in 5-way 5-shot molecular re-identification. These results suggest that explicitly integrating physical spectral resolution with molecular structure embedding is key to solving the generalization bottleneck in molecular identification from MS data.
Problem

Research questions and friction points this paper is trying to address.

molecular identification
mass spectrometry
domain generalization
semantic gap
molecular scaffold
Innovation

Methods, ideas, or system contributions that make the work stand out.

contrastive domain generalization
cross-modal alignment
mass spectrometry
molecular structure embedding
zero-shot retrieval
🔎 Similar Papers
No similar papers found.