🤖 AI Summary
Low BWT construction efficiency and high memory overhead arise from extreme length heterogeneity in DNA sequences. To address this, we propose an efficient external-memory BWT construction method tailored for variable-length sequences. Our key contributions are: (1) a novel right-aligned sorting strategy that eliminates reliance on sequence length uniformity—unlike conventional algorithms; (2) dynamic maintenance of insertion ranks using a balanced tree, enabling incremental processing of sequences; and (3) a co-designed fine-grained bucketing I/O optimization with tree-based indexing to enhance disk access locality. Evaluated on multiple real-world genomic datasets, our method achieves 10–40% speedup over state-of-the-art approaches while maintaining comparable memory consumption. This advancement significantly facilitates FM-index deployment in real-time read alignment and de novo assembly.
📝 Abstract
The Burrows-Wheeler transform (BWT) is integral to the FM-index, which is used extensively in text compression, indexing, pattern search, and bioinformatic problems as de novo assembly and read alignment. Thus, efficient construction of the BWT in terms of time and memory usage is key to these applications. We present a novel external algorithm called Improved-Bucket Burrows-Wheeler transform (IBB) for constructing the BWT of DNA datasets with highly diverse sequence lengths. IBB uses a right-aligned approach to efficiently handle sequences of varying lengths, a tree-based data structure to manage relative insert positions and ranks, and fine buckets to reduce the necessary amount of input and output to external memory. Our experiments demonstrate that IBB is 10% to 40% faster than the best existing state-of-the-art BWT construction algorithms on most datasets while maintaining competitive memory consumption.