🤖 AI Summary
Wastewater pathogen surveillance lacks high-sensitivity, reference-free modeling tools capable of characterizing uncultured, unassembled, and highly diverse microbial communities. Method: We introduce WasteDNA—the first foundational model for wastewater metagenomics—pretrained end-to-end on 1.5 terabases of real human wastewater sequencing data. It employs a DNA/RNA-customized byte-pair encoding (BPE) scheme and a 7-billion-parameter autoregressive Transformer architecture. Contribution/Results: WasteDNA is the first foundation model trained directly on raw, unassembled metagenomic sequences without reliance on reference genomes or assembly. It achieves state-of-the-art performance on novel benchmarks for pathogen detection and genomic embedding; improves rare virus identification sensitivity by +32%; enables zero-shot cross-species generalization; and supports real-time biological threat screening—establishing a scalable, foundational framework for pandemic early warning and emerging health threat identification.
📝 Abstract
We pretrain METAGENE-1, a 7-billion-parameter autoregressive transformer model, which we refer to as a metagenomic foundation model, on a novel corpus of diverse metagenomic DNA and RNA sequences comprising over 1.5 trillion base pairs. This dataset is sourced from a large collection of human wastewater samples, processed and sequenced using deep metagenomic (next-generation) sequencing methods. Unlike genomic models that focus on individual genomes or curated sets of specific species, the aim of METAGENE-1 is to capture the full distribution of genomic information present within this wastewater, to aid in tasks relevant to pandemic monitoring and pathogen detection. We carry out byte-pair encoding (BPE) tokenization on our dataset, tailored for metagenomic sequences, and then pretrain our model. In this paper, we first detail the pretraining dataset, tokenization strategy, and model architecture, highlighting the considerations and design choices that enable the effective modeling of metagenomic data. We then show results of pretraining this model on our metagenomic dataset, providing details about our losses, system metrics, and training stability over the course of pretraining. Finally, we demonstrate the performance of METAGENE-1, which achieves state-of-the-art results on a set of genomic benchmarks and new evaluations focused on human-pathogen detection and genomic sequence embedding, showcasing its potential for public health applications in pandemic monitoring, biosurveillance, and early detection of emerging health threats.