AutoSciDACT: Automated Scientific Discovery through Contrastive Embedding and Hypothesis Testing

📅 2025-10-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Detecting novelties in high-dimensional, noisy scientific big data remains challenging, and existing anomaly detection methods often lack statistical rigor in declaring anomalies. Method: This paper proposes an end-to-end framework integrating contrastive representation learning with nonparametric two-sample hypothesis testing. It innovatively combines contrastive pre-trained embeddings with the New Physics Learning Machine (NPLM), augmented by simulation-based data augmentation and low-dimensional feature extraction. Contribution/Results: The framework enables highly sensitive detection of subtle anomalous signals and supports quantifiable, p-value-driven scientific discovery. Evaluated on real and synthetic datasets across astronomy, physics, and biology, it significantly outperforms state-of-the-art anomaly detection methods, demonstrating strong robustness to noise, interpretability via statistically grounded inference, and cross-domain generalizability.

Technology Category

Application Category

📝 Abstract
Novelty detection in large scientific datasets faces two key challenges: the noisy and high-dimensional nature of experimental data, and the necessity of making statistically robust statements about any observed outliers. While there is a wealth of literature on anomaly detection via dimensionality reduction, most methods do not produce outputs compatible with quantifiable claims of scientific discovery. In this work we directly address these challenges, presenting the first step towards a unified pipeline for novelty detection adapted for the rigorous statistical demands of science. We introduce AutoSciDACT (Automated Scientific Discovery with Anomalous Contrastive Testing), a general-purpose pipeline for detecting novelty in scientific data. AutoSciDACT begins by creating expressive low-dimensional data representations using a contrastive pre-training, leveraging the abundance of high-quality simulated data in many scientific domains alongside expertise that can guide principled data augmentation strategies. These compact embeddings then enable an extremely sensitive machine learning-based two-sample test using the New Physics Learning Machine (NPLM) framework, which identifies and statistically quantifies deviations in observed data relative to a reference distribution (null hypothesis). We perform experiments across a range of astronomical, physical, biological, image, and synthetic datasets, demonstrating strong sensitivity to small injections of anomalous data across all domains.
Problem

Research questions and friction points this paper is trying to address.

Detecting novelty in noisy high-dimensional scientific datasets statistically
Creating quantifiable anomaly detection compatible with scientific discovery claims
Developing sensitive statistical tests for deviations from reference distributions
Innovation

Methods, ideas, or system contributions that make the work stand out.

Contrastive pre-training creates low-dimensional data representations
Two-sample test statistically quantifies deviations from reference distribution
Pipeline combines embeddings with hypothesis testing for novelty detection
🔎 Similar Papers
No similar papers found.
S
Samuel Bright-Thonney
Department of Physics, Massachusetts Institute of Technology
C
Christina Reissel
Department of Physics, Massachusetts Institute of Technology
G
Gaia Grosso
Department of Physics, Massachusetts Institute of Technology
N
Nathaniel Woodward
Department of Physics, Massachusetts Institute of Technology
K
Katya Govorkova
Department of Physics, Massachusetts Institute of Technology
A
Andrzej Novak
Department of Physics, Massachusetts Institute of Technology
S
Sang Eon Park
Department of Physics, Massachusetts Institute of Technology
E
Eric Moreno
Department of Physics, Massachusetts Institute of Technology
Philip Harris
Philip Harris
MIT
Machine LearningDark MatterHiggs bosonGravitational WavesFPGAs