FreshStack: Building Realistic Benchmarks for Evaluating Retrieval on Technical Documents

📅 2025-04-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the lack of realistic, scalable, and contamination-free information retrieval (IR) and retrieval-augmented generation (RAG) benchmarks for technical documentation. We propose the first automated, nugget-level IR benchmark construction framework. Methodologically, we automatically crawl community Q&A and code documentation corpora; generate high-quality, fine-grained information nuggets via question-driven synthesis; and employ a hybrid retrieval architecture integrating BM25, embedding-based retrieval, and cross-encoders to achieve precise nugget-level annotation. Our contributions include: (1) five challenging, domain-specific benchmark datasets covering emerging and niche technical topics; (2) the first empirical demonstration that state-of-the-art retrieval models exhibit substantial performance gaps relative to oracle upper bounds; and (3) the discovery that re-rankers fail to improve top-k recall accuracy on 2 out of 5 topics—revealing critical real-world bottlenecks in retrieval pipelines.

Technology Category

Application Category

📝 Abstract
We introduce FreshStack, a reusable framework for automatically building information retrieval (IR) evaluation benchmarks from community-asked questions and answers. FreshStack conducts the following steps: (1) automatic corpus collection from code and technical documentation, (2) nugget generation from community-asked questions and answers, and (3) nugget-level support, retrieving documents using a fusion of retrieval techniques and hybrid architectures. We use FreshStack to build five datasets on fast-growing, recent, and niche topics to ensure the tasks are sufficiently challenging. On FreshStack, existing retrieval models, when applied out-of-the-box, significantly underperform oracle approaches on all five topics, denoting plenty of headroom to improve IR quality. In addition, we identify cases where rerankers do not clearly improve first-stage retrieval accuracy (two out of five topics). We hope that FreshStack will facilitate future work toward constructing realistic, scalable, and uncontaminated IR and RAG evaluation benchmarks. FreshStack datasets are available at: https://fresh-stack.github.io.
Problem

Research questions and friction points this paper is trying to address.

Automate building IR benchmarks from Q&A and docs
Evaluate retrieval models on niche, recent topics
Assess reranker impact on retrieval accuracy gaps
Innovation

Methods, ideas, or system contributions that make the work stand out.

Automatic corpus collection from technical docs
Nugget generation from Q&A pairs
Hybrid retrieval fusion for document support
🔎 Similar Papers
No similar papers found.