Is This Collection Worth My LLM's Time? Automatically Measuring Information Potential in Text Corpora

📅 2025-02-19

📈 Citations: 0

✨ Influential: 0

career value

150K/year

🤖 AI Summary

This study addresses the critical problem of efficiently evaluating the potential value of textual corpora for large language models (LLMs). We propose a training-free, automated assessment method: source texts are used to generate multiple-choice questions, and the performance gap—measured in zero-shot settings—between LLM predictions with and without the text as prompt serves as a proxy for information gain. This approach departs from conventional data-evaluation paradigms reliant on model fine-tuning or retraining, enabling, for the first time, corpus-value quantification entirely within the pretraining phase. Empirical validation across three heterogeneous benchmarks—EPFL doctoral theses, Wikipedia articles, and synthetic data—demonstrates that the method effectively discriminates texts containing novel domain expertise from those covering already-learned knowledge, accurately identifies high-potential corpora, and provides a scalable, low-cost foundation for prioritizing data acquisition and preprocessing decisions.

Technology Category

Application Category

📝 Abstract

As large language models (LLMs) converge towards similar capabilities, the key to advancing their performance lies in identifying and incorporating valuable new information sources. However, evaluating which text collections are worth the substantial investment required for digitization, preprocessing, and integration into LLM systems remains a significant challenge. We present a novel approach to this challenge: an automated pipeline that evaluates the potential information gain from text collections without requiring model training or fine-tuning. Our method generates multiple choice questions (MCQs) from texts and measures an LLM's performance both with and without access to the source material. The performance gap between these conditions serves as a proxy for the collection's information potential. We validate our approach using three strategically selected datasets: EPFL PhD manuscripts (likely containing novel specialized knowledge), Wikipedia articles (presumably part of training data), and a synthetic baseline dataset. Our results demonstrate that this method effectively identifies collections containing valuable novel information, providing a practical tool for prioritizing data acquisition and integration efforts.

Problem

Research questions and friction points this paper is trying to address.

Automated evaluation of text collections

Measuring potential information gain

Prioritizing data for LLM integration

Innovation

Methods, ideas, or system contributions that make the work stand out.

Automated pipeline for evaluation

Generates MCQs from texts

Measures LLM performance gap

🔎 Similar Papers

LLM Reading Tea Leaves: Automatically Evaluating Topic Models with Large Language Models