🤖 AI Summary
This study addresses the critical problem of efficiently evaluating the potential value of textual corpora for large language models (LLMs). We propose a training-free, automated assessment method: source texts are used to generate multiple-choice questions, and the performance gap—measured in zero-shot settings—between LLM predictions with and without the text as prompt serves as a proxy for information gain. This approach departs from conventional data-evaluation paradigms reliant on model fine-tuning or retraining, enabling, for the first time, corpus-value quantification entirely within the pretraining phase. Empirical validation across three heterogeneous benchmarks—EPFL doctoral theses, Wikipedia articles, and synthetic data—demonstrates that the method effectively discriminates texts containing novel domain expertise from those covering already-learned knowledge, accurately identifies high-potential corpora, and provides a scalable, low-cost foundation for prioritizing data acquisition and preprocessing decisions.
📝 Abstract
As large language models (LLMs) converge towards similar capabilities, the key to advancing their performance lies in identifying and incorporating valuable new information sources. However, evaluating which text collections are worth the substantial investment required for digitization, preprocessing, and integration into LLM systems remains a significant challenge. We present a novel approach to this challenge: an automated pipeline that evaluates the potential information gain from text collections without requiring model training or fine-tuning. Our method generates multiple choice questions (MCQs) from texts and measures an LLM's performance both with and without access to the source material. The performance gap between these conditions serves as a proxy for the collection's information potential. We validate our approach using three strategically selected datasets: EPFL PhD manuscripts (likely containing novel specialized knowledge), Wikipedia articles (presumably part of training data), and a synthetic baseline dataset. Our results demonstrate that this method effectively identifies collections containing valuable novel information, providing a practical tool for prioritizing data acquisition and integration efforts.