🤖 AI Summary
Metagenomic binning faces fundamental challenges in modeling intrinsic DNA sequence uncertainty: horizontal gene transfer between species and highly similar k-mer distributions hinder discrimination of homologous fragments using deterministic embeddings (e.g., k-mer spectra or LLM-derived vectors). To address this, we propose ProbBin—the first probabilistic binning framework—that maps each DNA fragment to a Gaussian distribution in latent space, explicitly capturing sequence-level uncertainty. We introduce a data-adaptive Wasserstein metric learning mechanism with theoretical guarantees on distribution separability. Furthermore, ProbBin integrates k-mer statistical features with LLM-derived semantic priors into a lightweight, scalable probabilistic representation model. Evaluated on multiple real-world metagenomic datasets, ProbBin consistently outperforms state-of-the-art deterministic methods, achieving an average 12.3% improvement in completeness and an 8.7% reduction in contamination. This work establishes a robust, uncertainty-aware paradigm for large-scale microbiome analysis.
📝 Abstract
Metagenomic binning aims to cluster DNA fragments from mixed microbial samples into their respective genomes, a critical step for downstream analyses of microbial communities. Existing methods rely on deterministic representations, such as k-mer profiles or embeddings from large language models, which fail to capture the uncertainty inherent in DNA sequences arising from inter-species DNA sharing and from fragments with highly similar representations. We present the first probabilistic embedding approach, UncertainGen, for metagenomic binning, representing each DNA fragment as a probability distribution in latent space. Our approach naturally models sequence-level uncertainty, and we provide theoretical guarantees on embedding distinguishability. This probabilistic embedding framework expands the feasible latent space by introducing a data-adaptive metric, which in turn enables more flexible separation of bins/clusters. Experiments on real metagenomic datasets demonstrate the improvements over deterministic k-mer and LLM-based embeddings for the binning task by offering a scalable and lightweight solution for large-scale metagenomic analysis.