🤖 AI Summary
Current mass spectrometry–based molecular structure identification methods lack sufficient reliability in high-stakes applications, necessitating robust assessment of prediction confidence. This work proposes a selective prediction framework that actively abstains from making predictions when uncertainty is excessive, leveraging a risk–coverage trade-off mechanism. It presents the first systematic evaluation of fingerprint-level versus retrieval-level uncertainty quantification strategies and their impact on annotation reliability. Integrating distribution-free risk control, the framework enables users to specify a tolerable error rate and guarantees— with high probability—that this constraint is satisfied. Experimental results demonstrate that retrieval-level aleatoric uncertainty combined with a low-cost first-order confidence metric effectively balances risk and coverage, significantly outperforming fingerprint-level approaches and offering a highly reliable annotation solution for critical applications such as clinical metabolomics.
📝 Abstract
Machine learning methods for identifying molecular structures from tandem mass spectra (MS/MS) have advanced rapidly, yet current approaches still exhibit significant error rates. In high-stakes applications such as clinical metabolomics and environmental screening, incorrect annotations can have serious consequences, making it essential to determine when a prediction can be trusted. We introduce a selective prediction framework for molecular structure retrieval from MS/MS spectra, enabling models to abstain from predictions when uncertainty is too high. We formulate the problem within the risk-coverage tradeoff framework and comprehensively evaluate uncertainty quantification strategies at two levels of granularity: fingerprint-level uncertainty over predicted molecular fingerprint bits, and retrieval-level uncertainty over candidate rankings. We compare scoring functions including first-order confidence measures, aleatoric and epistemic uncertainty estimates from second-order distributions, as well as distance-based measures in the latent space. All experiments are conducted on the MassSpecGym benchmark. Our analysis reveals that while fingerprint-level uncertainty scores are poor proxies for retrieval success, computationally inexpensive first-order confidence measures and retrieval-level aleatoric uncertainty achieve strong risk-coverage tradeoffs across evaluation settings. We demonstrate that by applying distribution-free risk control via generalization bounds, practitioners can specify a tolerable error rate and obtain a subset of annotations satisfying that constraint with high probability.