How Fine-Grained Should a RAG Benchmark Be? A Hierarchical Framework for Synthetic Question Generation

📅 2026-06-10

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work addresses the lack of empirical guidance in selecting granularity levels across problem dimensions for evaluating Retrieval-Augmented Generation (RAG) systems. We propose HieraRAG, a framework that systematically investigates the impact of granularity on benchmark construction along three axes: question complexity, answer type, and linguistic variation. The optimal granularity is defined as the level that maximizes discriminative power in generation quality. To this end, we introduce a transferable hierarchical granularity selection method and a novel Coherence Ratio metric. Leveraging FineWeb-10BT, we synthesize 5,872 high-quality question-answer pairs, employing BM25 for retrieval and Falcon-3-10B for generation, with data quality validated through human evaluation. Experiments reveal that fine-grained question complexity yields the highest discriminability (0.053), while medium granularity proves superior for answer type and linguistic variation, demonstrating the framework’s effectiveness and practical utility.

📝 Abstract

Evaluating retrieval-augmented generation (RAG) systems requires benchmarks that capture diverse question characteristics, yet practitioners lack empirical guidance on which dimensions to vary and at what granularity. We present HieraRAG, a hierarchical framework for studying granularity in RAG benchmark construction, defining optimal granularity as the level that maximizes discriminative power (the standard deviation of generation quality across categories) within a given RAG configuration. As a case study, we generate 5,872 synthetic question-answer (QA) pairs from FineWeb-10BT across 3 dimensions (Question Complexity, Answer Type, Linguistic Variation) at 3 granularity levels (2, 4, and 8 categories). With a BM25+Falcon-3-10B pipeline, optimal granularity varies by dimension: complexity benefits from fine-grained distinctions (discriminative power: 0.053) while answer type and linguistic variation peak at medium granularity. We introduce a Coherence Ratio metric to quantify whether fine-grained splits cleanly subdivide parent categories, revealing structural differences across dimensions (Question Complexity: 0.40 vs. Answer Type: 1.44). Human evaluation of 110 stratified QA pairs confirms synthetic quality. While these specific findings reflect a single configuration, HieraRAG provides a portable procedure and validation metric for practitioners to determine evaluation granularity within their own RAG settings.

Problem

Research questions and friction points this paper is trying to address.

RAG benchmark

granularity

question generation

evaluation framework

discriminative power

Innovation

Methods, ideas, or system contributions that make the work stand out.

RAG benchmarking

hierarchical granularity

synthetic question generation