BarcodeBERT: Transformers for Biodiversity Analysis

📅 2023-11-04
🏛️ arXiv.org
📈 Citations: 2
Influential: 0
📄 PDF
🤖 AI Summary
To address the low accuracy and inefficiency of DNA barcoding for genus- and species-level classification, this study introduces the first self-supervised Transformer model family specifically designed for DNA barcodes. Leveraging 1.5 million invertebrate COI sequences, we incorporate domain-specific biological priors to devise a tailored masking strategy and tokenization scheme, enabling efficient pretraining. Compared with fine-tuned general-purpose DNA foundation models and conventional machine learning methods, our model achieves BLAST-level accuracy on species-level classification while accelerating inference by 55×. Moreover, it significantly outperforms supervised neural networks and existing foundation models on genus- and species-level identification tasks. This work establishes a scalable, high-throughput, and high-accuracy paradigm for biodiversity monitoring, bridging critical gaps between deep learning and molecular taxonomy.
📝 Abstract
In the global challenge of understanding and characterizing biodiversity, short species-specific genomic sequences known as DNA barcodes play a critical role, enabling fine-grained comparisons among organisms within the same kingdom of life. Although machine learning algorithms specifically designed for the analysis of DNA barcodes are becoming more popular, most existing methodologies rely on generic supervised training algorithms. We introduce BarcodeBERT, a family of models tailored to biodiversity analysis and trained exclusively on data from a reference library of 1.5M invertebrate DNA barcodes. We compared the performance of BarcodeBERT on taxonomic identification tasks against a spectrum of machine learning approaches including supervised training of classical neural architectures and fine-tuning of general DNA foundation models. Our self-supervised pretraining strategies on domain-specific data outperform fine-tuned foundation models, especially in identification tasks involving lower taxa such as genera and species. We also compared BarcodeBERT with BLAST, one of the most widely used bioinformatics tools for sequence searching, and found that our method matched BLAST's performance in species-level classification while being 55 times faster. Our analysis of masking and tokenization strategies also provides practical guidance for building customized DNA language models, emphasizing the importance of aligning model training strategies with dataset characteristics and domain knowledge. The code repository is available at https://github.com/bioscan-ml/BarcodeBERT.
Problem

Research questions and friction points this paper is trying to address.

DNA Barcoding
Machine Learning
Biodiversity Analysis
Innovation

Methods, ideas, or system contributions that make the work stand out.

BarcodeBERT
Transformer model
DNA barcode
P
Pablo Millán Arias
University of Waterloo
N
Niousha Sadjadi
University of Waterloo
M
Monireh Safari
University of Waterloo
ZeMing Gong
ZeMing Gong
Master student
Computer VisionNatural Language ProcessingMulti‑modal Machine Learning
A
Austin T. Wang
Simon Fraser University
J
J. B. Haurum
Aalborg University and Pioneer Centre for AI
I
Iuliia Zarubiieva
Vector Institute, University of Guelph
D
Dirk Steinke
University of Guelph
L
Lila Kari
University of Waterloo
A
A. X. Chang
Alberta Machine Intelligence Institute (Amii), Simon Fraser University
Scott C. Lowe
Scott C. Lowe
Postdoctoral Research Fellow, Vector Institute
Machine LearningDeep learningNeuroinformaticsSelf-supervisionReasoning
G
Graham W. Taylor
Vector Institute, University of Guelph