🤖 AI Summary
This work addresses the limitations of traditional web data filtering, which relies on a single quality score and often overlooks semantically valuable content. The authors propose a multidimensional filtering framework grounded in an extended ESSENTIAL-WEB taxonomy, incorporating novel dimensions such as timeliness and cultural specificity. Their two-stage efficient selection strategy accurately recovers high-quality data that would otherwise be undervalued, while substantially reducing computational overhead. Leveraging large-scale annotations from Qwen2.5-32B, they distill a lightweight 0.5B model and combine it with E5 embeddings in a 73M-parameter multitask MLP for rapid inference. Experiments demonstrate that filtered mid-to-low-tier data improves performance by 12.1% on reasoning tasks and 9.5% on programming tasks, with the lowest two tiers achieving a remarkable 19.5% gain in programming—surpassing even the original highest-quality tier.
📝 Abstract
Dominant web data curation pipelines for pretraining collapse document quality into a single composite score, systematically missing high-value content along dimensions the scorer underweights. We present a taxonomy-driven framework that recovers this value by filtering along semantically meaningful dimensions that composite scores fail to capture. First, building on the ESSENTIAL-WEB taxonomy, we introduce two novel dimensions: timeliness and cultural specificity, both of which show low pairwise NMI with existing ones. We annotate 14M documents using Qwen2.5 32B and distill into a lightweight 0.5B model. To enable rapid corpus-wide annotation, we additionally train a 73M multi-task MLP on E5 embeddings, achieving 50x inference throughput. Second, to navigate the combinatorial explosion of filter configurations, we introduce a compute-efficient two-pass framework: Pass 1 identifies the strongest dimension signals at small scale; Pass 2 constructs and evaluates conjunctive and disjunctive compound filters from the top performers - identifying high-performing configurations at a fraction of full scaling-law cost. Applying the selected filters to deprioritized web data, taxonomy-filtered subsets outperform their unfiltered baselines and even surpass the highest-quality tier. On mid-tier data, our best filter improves over its unfiltered baseline by 12.1% on reasoning, 9.5% on coding, and 2.0% on knowledge benchmarks, exceeding unfiltered top-tier data by 6.7% on reasoning and 13.7% on coding. Furthermore, filtered data from two tiers below the typical production threshold improves by 22.3% on reasoning and 19.5% on coding over its unfiltered baseline, surpassing top-tier data on coding benchmarks. These results establish that vast latent value remains locked in deprioritized web data, and that multi-dimensional taxonomy filtering is a principled, compute-efficient key to unlocking it.