🤖 AI Summary
This work addresses the lack of empirical guidance in selecting granularity levels across problem dimensions for evaluating Retrieval-Augmented Generation (RAG) systems. We propose HieraRAG, a framework that systematically investigates the impact of granularity on benchmark construction along three axes: question complexity, answer type, and linguistic variation. The optimal granularity is defined as the level that maximizes discriminative power in generation quality. To this end, we introduce a transferable hierarchical granularity selection method and a novel Coherence Ratio metric. Leveraging FineWeb-10BT, we synthesize 5,872 high-quality question-answer pairs, employing BM25 for retrieval and Falcon-3-10B for generation, with data quality validated through human evaluation. Experiments reveal that fine-grained question complexity yields the highest discriminability (0.053), while medium granularity proves superior for answer type and linguistic variation, demonstrating the framework’s effectiveness and practical utility.
📝 Abstract
Evaluating retrieval-augmented generation (RAG) systems requires benchmarks that capture diverse question characteristics, yet practitioners lack empirical guidance on which dimensions to vary and at what granularity. We present HieraRAG, a hierarchical framework for studying granularity in RAG benchmark construction, defining optimal granularity as the level that maximizes discriminative power (the standard deviation of generation quality across categories) within a given RAG configuration. As a case study, we generate 5,872 synthetic question-answer (QA) pairs from FineWeb-10BT across 3 dimensions (Question Complexity, Answer Type, Linguistic Variation) at 3 granularity levels (2, 4, and 8 categories). With a BM25+Falcon-3-10B pipeline, optimal granularity varies by dimension: complexity benefits from fine-grained distinctions (discriminative power: 0.053) while answer type and linguistic variation peak at medium granularity. We introduce a Coherence Ratio metric to quantify whether fine-grained splits cleanly subdivide parent categories, revealing structural differences across dimensions (Question Complexity: 0.40 vs. Answer Type: 1.44). Human evaluation of 110 stratified QA pairs confirms synthetic quality. While these specific findings reflect a single configuration, HieraRAG provides a portable procedure and validation metric for practitioners to determine evaluation granularity within their own RAG settings.