LADM: Long-context Training Data Selection with Attention-based Dependency Measurement for LLMs

📅 2025-03-04
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses the challenge of quantifying the quality of long-context training data. We propose the Long-Context Attention-based Dependency Measurement (LADM) framework, a self-supervised, attention-driven approach that automatically identifies high-quality long-context samples from large-scale corpora without human annotation. LADM quantifies token-level long-range dependency strength directly from transformer attention patterns—a novel application of attention mechanisms for dependency assessment—and integrates retrieval-based filtering with lightweight continual pretraining (1B tokens). Experiments demonstrate that models trained on LADM-filtered data achieve significant performance gains on long-context benchmarks, including document question answering and long-range reasoning tasks. Our work establishes an efficient, interpretable, and scalable paradigm for constructing high-quality long-context datasets, advancing data curation methodologies for large language models.

Technology Category

Application Category

📝 Abstract
Long-context modeling has drawn more and more attention in the area of Large Language Models (LLMs). Continual training with long-context data becomes the de-facto method to equip LLMs with the ability to process long inputs. However, it still remains an open challenge to measure the quality of long-context training data. To address this issue, we propose a Long-context data selection framework with Attention-based Dependency Measurement (LADM), which can efficiently identify high-quality long-context data from a large-scale, multi-domain pre-training corpus. LADM leverages the retrieval capabilities of the attention mechanism to capture contextual dependencies, ensuring a comprehensive quality measurement of long-context data. Experimental results show that our LADM framework significantly boosts the performance of LLMs on multiple long-context tasks with only 1B tokens for continual training.
Problem

Research questions and friction points this paper is trying to address.

Measure quality of long-context training data for LLMs.
Select high-quality long-context data efficiently.
Improve LLM performance on long-context tasks.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Attention-based Dependency Measurement for data selection
Efficient identification of high-quality long-context data
Boosts LLM performance with minimal training tokens
🔎 Similar Papers
No similar papers found.