Is This Collection Worth My LLM's Time? Automatically Measuring Information Potential in Text Corpora

📅 2025-02-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses the critical problem of efficiently evaluating the potential value of textual corpora for large language models (LLMs). We propose a training-free, automated assessment method: source texts are used to generate multiple-choice questions, and the performance gap—measured in zero-shot settings—between LLM predictions with and without the text as prompt serves as a proxy for information gain. This approach departs from conventional data-evaluation paradigms reliant on model fine-tuning or retraining, enabling, for the first time, corpus-value quantification entirely within the pretraining phase. Empirical validation across three heterogeneous benchmarks—EPFL doctoral theses, Wikipedia articles, and synthetic data—demonstrates that the method effectively discriminates texts containing novel domain expertise from those covering already-learned knowledge, accurately identifies high-potential corpora, and provides a scalable, low-cost foundation for prioritizing data acquisition and preprocessing decisions.

Technology Category

Application Category

📝 Abstract
As large language models (LLMs) converge towards similar capabilities, the key to advancing their performance lies in identifying and incorporating valuable new information sources. However, evaluating which text collections are worth the substantial investment required for digitization, preprocessing, and integration into LLM systems remains a significant challenge. We present a novel approach to this challenge: an automated pipeline that evaluates the potential information gain from text collections without requiring model training or fine-tuning. Our method generates multiple choice questions (MCQs) from texts and measures an LLM's performance both with and without access to the source material. The performance gap between these conditions serves as a proxy for the collection's information potential. We validate our approach using three strategically selected datasets: EPFL PhD manuscripts (likely containing novel specialized knowledge), Wikipedia articles (presumably part of training data), and a synthetic baseline dataset. Our results demonstrate that this method effectively identifies collections containing valuable novel information, providing a practical tool for prioritizing data acquisition and integration efforts.
Problem

Research questions and friction points this paper is trying to address.

Automated evaluation of text collections
Measuring potential information gain
Prioritizing data for LLM integration
Innovation

Methods, ideas, or system contributions that make the work stand out.

Automated pipeline for evaluation
Generates MCQs from texts
Measures LLM performance gap
🔎 Similar Papers
No similar papers found.