Prefix-free parsing for merging big BWTs

📅 2025-06-03

📈 Citations: 0

✨ Influential: 0

career value

194K/year

🤖 AI Summary

Constructing the Burrows–Wheeler Transform (BWT) for ultra-large datasets—such as multi-species genomes or human chromosome collections—poses severe memory challenges for prefix-free parsing (PFP), whose conventional implementation requires loading the entire input into memory, leading to prohibitively high peak memory usage. To address this, we propose a divide-and-conquer framework that integrates BWT merging into PFP for the first time. Our approach comprises three key components: (i) locality-aware subset partitioning exploiting sequence similarity; (ii) an enhanced PFP index structure supporting efficient subset processing; and (iii) an incremental BWT merging algorithm leveraging suffix arrays and lightweight sampling. This enables independent construction of subset BWTs followed by highly efficient global merging. The method breaks the traditional PFP memory bottleneck while preserving BWT correctness and linear-time asymptotic complexity, reducing peak memory consumption by 70–90% compared to standard PFP.

Technology Category

Application Category

📝 Abstract

When building Burrows-Wheeler Transforms (BWTs) of truly huge datasets, prefix-free parsing (PFP) can use an unreasonable amount of memory. In this paper we show how if a dataset can be broken down into small datasets that are not very similar to each other -- such as collections of many copies of genomes of each of several species, or collections of many copies of each of the human chromosomes -- then we can drastically reduce PFP's memory footprint by building the BWTs of the small datasets and then merging them into the BWT of the whole dataset.

Problem

Research questions and friction points this paper is trying to address.

Reducing memory usage in BWT construction

Merging BWTs of dissimilar small datasets

Optimizing prefix-free parsing for large datasets

Innovation

Methods, ideas, or system contributions that make the work stand out.

Prefix-free parsing reduces memory usage

Merge BWTs of small dissimilar datasets

Efficient BWT construction for huge datasets

🔎 Similar Papers

String Partition for Building Long Burrows-Wheeler Transforms