π€ AI Summary
This work addresses the limitation of existing word segmentation methods, which often neglect data quality and consequently underperform on noisy corpora. The authors propose QA-Token, a quality-aware tokenization model that, for the first time, integrates data reliability into vocabulary construction. Leveraging a bilevel optimization framework, a convergence-guaranteed reinforcement learning merging strategy, and an end-to-end training mechanism based on Gumbel-Softmax, QA-Token enables adaptive parameter learning. Empirical results demonstrate substantial gains: a 6.7-point F1 improvement in genomics tasks, a 30% increase in Sharpe ratio in financial applications, and state-of-the-art pathogen detection performance with 94.53 MCC on a 1.7 trillion-base pretraining corpus, all while reducing token count by 15%.
π Abstract
Current tokenization methods process sequential data without accounting for signal quality, limiting their effectiveness on noisy real-world corpora. We present QA-Token (Quality-Aware Tokenization), which incorporates data reliability directly into vocabulary construction. We make three key contributions: (i) a bilevel optimization formulation that jointly optimizes vocabulary construction and downstream performance, (ii) a reinforcement learning approach that learns merge policies through quality-aware rewards with convergence guarantees, and (iii) an adaptive parameter learning mechanism via Gumbel-Softmax relaxation for end-to-end optimization. Our experimental evaluation demonstrates consistent improvements: genomics (6.7 percentage point F1 gain in variant calling over BPE), finance (30% Sharpe ratio improvement). At foundation scale, we tokenize a pretraining corpus comprising 1.7 trillion base-pairs and achieve state-of-the-art pathogen detection (94.53 MCC) while reducing token count by 15%. We unlock noisy real-world corpora, spanning petabases of genomic sequences and terabytes of financial time series, for foundation model training with zero inference overhead.