LADM: Long-context Training Data Selection with Attention-based Dependency Measurement for LLMs

📅 2025-03-04

📈 Citations: 0

✨ Influential: 0

career value

212K/year

🤖 AI Summary

This study addresses the challenge of quantifying the quality of long-context training data. We propose the Long-Context Attention-based Dependency Measurement (LADM) framework, a self-supervised, attention-driven approach that automatically identifies high-quality long-context samples from large-scale corpora without human annotation. LADM quantifies token-level long-range dependency strength directly from transformer attention patterns—a novel application of attention mechanisms for dependency assessment—and integrates retrieval-based filtering with lightweight continual pretraining (1B tokens). Experiments demonstrate that models trained on LADM-filtered data achieve significant performance gains on long-context benchmarks, including document question answering and long-range reasoning tasks. Our work establishes an efficient, interpretable, and scalable paradigm for constructing high-quality long-context datasets, advancing data curation methodologies for large language models.

Technology Category

Application Category

📝 Abstract

Long-context modeling has drawn more and more attention in the area of Large Language Models (LLMs). Continual training with long-context data becomes the de-facto method to equip LLMs with the ability to process long inputs. However, it still remains an open challenge to measure the quality of long-context training data. To address this issue, we propose a Long-context data selection framework with Attention-based Dependency Measurement (LADM), which can efficiently identify high-quality long-context data from a large-scale, multi-domain pre-training corpus. LADM leverages the retrieval capabilities of the attention mechanism to capture contextual dependencies, ensuring a comprehensive quality measurement of long-context data. Experimental results show that our LADM framework significantly boosts the performance of LLMs on multiple long-context tasks with only 1B tokens for continual training.

Problem

Research questions and friction points this paper is trying to address.

Measure quality of long-context training data for LLMs.

Select high-quality long-context data efficiently.

Improve LLM performance on long-context tasks.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Attention-based Dependency Measurement for data selection

Efficient identification of high-quality long-context data

Boosts LLM performance with minimal training tokens

🔎 Similar Papers

On the Feasibility of In-Context Probing for Data Attribution