Adaptive Targeted Dynamic Chunking for Tokenization-Free Hierarchical Model

📅 2026-05-28
📈 Citations: 0
Influential: 0
📄 PDF

career value

172K/year
🤖 AI Summary
This work addresses the challenge of optimizing compression ratios in token-free hierarchical models with byte-level dynamic chunking by proposing an Adaptive Target Dynamic Chunking (ATDC) mechanism. ATDC introduces curriculum learning into dynamic chunking control for the first time, progressively increasing the target compression ratio from low to high during training to ensure stable optimization. The method models the evolution of chunking through Bytes Per Innermost Chunk (BPIC). Evaluated on the FineWeb-Edu 100B dataset, ATDC achieves bits-per-byte (BPB) performance comparable to both token-level and byte-level baselines while demonstrating markedly improved training stability and superior performance across multiple downstream tasks.
📝 Abstract
Tokenization-free hierarchical models are emerging as a promising alternative to traditional Large Language Models (LLMs), addressing inherent preprocessing issues such as vocabulary design complexity, out-of-vocabulary (OOV) errors, and language-specific constraints. However, a significant challenge in these byte-level methods is the optimization of the compression ratio, a critical factor that dictates model performance for processing bytes data via chunks. In this paper, we propose Adaptive Targeted Dynamic Chunking (ATDC), a novel byte-compression control mechanism designed to enhance the effectiveness of dynamic chunking within hierarchical architectures. Our approach utilizes curriculum learning to progressively adjust the compression ratio during training, transitioning from low to high compression to stabilize the learning process. We provide an analysis establishing the relationship between the target compression ratio and Bytes-Per-Innermost-Chunk (BPIC), allowing for tracking of chunk-size evolution throughout the training phase. Evaluations conducted on the FineWeb-Edu 100B dataset demonstrate that hierarchical models equipped with ATDC achieve competitive Bits-Per-Byte (BPB) performance compared to conventional baselines operating at both byte and token levels. Furthermore, the proposed method exhibits more stable training dynamics and superior final performance across diverse downstream tasks compared to models using fixed compression ratios, while maintaining the inherent robustness and flexibility of byte-level processing.
Problem

Research questions and friction points this paper is trying to address.

tokenization-free
hierarchical model
byte-level compression
dynamic chunking
compression ratio
Innovation

Methods, ideas, or system contributions that make the work stand out.

Adaptive Targeted Dynamic Chunking
tokenization-free
hierarchical model
compression ratio
curriculum learning
🔎 Similar Papers
2024-06-21arXiv.orgCitations: 0