Unlocking Noisy Real-World Corpora for Foundation Model Pre-Training via Quality-Aware Tokenization

πŸ“… 2026-02-06
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work addresses the limitation of existing word segmentation methods, which often neglect data quality and consequently underperform on noisy corpora. The authors propose QA-Token, a quality-aware tokenization model that, for the first time, integrates data reliability into vocabulary construction. Leveraging a bilevel optimization framework, a convergence-guaranteed reinforcement learning merging strategy, and an end-to-end training mechanism based on Gumbel-Softmax, QA-Token enables adaptive parameter learning. Empirical results demonstrate substantial gains: a 6.7-point F1 improvement in genomics tasks, a 30% increase in Sharpe ratio in financial applications, and state-of-the-art pathogen detection performance with 94.53 MCC on a 1.7 trillion-base pretraining corpus, all while reducing token count by 15%.

Technology Category

Application Category

πŸ“ Abstract
Current tokenization methods process sequential data without accounting for signal quality, limiting their effectiveness on noisy real-world corpora. We present QA-Token (Quality-Aware Tokenization), which incorporates data reliability directly into vocabulary construction. We make three key contributions: (i) a bilevel optimization formulation that jointly optimizes vocabulary construction and downstream performance, (ii) a reinforcement learning approach that learns merge policies through quality-aware rewards with convergence guarantees, and (iii) an adaptive parameter learning mechanism via Gumbel-Softmax relaxation for end-to-end optimization. Our experimental evaluation demonstrates consistent improvements: genomics (6.7 percentage point F1 gain in variant calling over BPE), finance (30% Sharpe ratio improvement). At foundation scale, we tokenize a pretraining corpus comprising 1.7 trillion base-pairs and achieve state-of-the-art pathogen detection (94.53 MCC) while reducing token count by 15%. We unlock noisy real-world corpora, spanning petabases of genomic sequences and terabytes of financial time series, for foundation model training with zero inference overhead.
Problem

Research questions and friction points this paper is trying to address.

tokenization
noisy corpora
foundation model
signal quality
pre-training
Innovation

Methods, ideas, or system contributions that make the work stand out.

Quality-Aware Tokenization
Bilevel Optimization
Reinforcement Learning
Gumbel-Softmax Relaxation
Foundation Model Pre-Training
πŸ”Ž Similar Papers
No similar papers found.
Arvid E. Gollwitzer
Arvid E. Gollwitzer
MIT | ETH Zurich | Broad Institute of MIT and Harvard | CERN
Computational GenomicsClinical MetagenomicsCancer DetectionTargeted Drug Delivery
P
Paridhi Latawa
Massachusetts Institute of Technology, Cambridge, MA, USA
D
David de Gruijl
Anto Biosciences (YC F25)
D
Deepak A. Subramanian
Broad Institute of MIT and Harvard, Cambridge, MA, USA; Koch Institute for Integrative Cancer Research, MIT, Cambridge, MA, USA
A
AdriΓ‘n Noriega de la Colina
Broad Institute of MIT and Harvard, Cambridge, MA, USA; Massachusetts Institute of Technology, Cambridge, MA, USA; Department of Neurology and Neurosurgery, McGill University, Montreal, Canada; The Montreal Neurological Hospital-Institute, Montreal, Canada