🤖 AI Summary
This work addresses critical limitations in existing VideoRAG research, which suffers from the absence of benchmarks that realistically reflect retrieval errors and typically employs uniform, query-level configurations of single modality and fixed temporal granularity, thereby ignoring the heterogeneity among video segments. To overcome these issues, the authors introduce V-RAGBench, a new benchmark comprising query-evidence-answer triplets that enable decoupled evaluation of retrieval and generation, and CARVE, a novel method that dynamically selects optimal multimodal, multi-granularity retrieval configurations at the video chunk level. CARVE leverages parallel retrieval, chunk-adaptive reranking, and interleaved evidence fusion to tailor retrieval strategies per chunk and propagate them to the generation stage. Experiments demonstrate that CARVE significantly outperforms eight state-of-the-art baselines on V-RAGBench, effectively transcending the constraints of conventional query-level approaches.
📝 Abstract
Retrieval-augmented generation is moving beyond text into long, egocentric video, where systems must select query-relevant chunks across multiple modalities and temporal granularities. Yet progress in VideoRAG is limited by two gaps: existing benchmarks allow queries to be answered without the video, obscuring retrieval errors, and prior methods apply a single modality-granularity configuration per query, ignoring chunk-level variability. We address both by introducing V-RAGBench, a benchmark of $\langle$query, evidence chunk, answer$\rangle$ triplets that enables faithful, decoupled evaluation of retrieval and generation, and CARVE, a simple method that runs parallel retrievers across configurations and employs chunk-adaptive reranking to identify the winning configuration for each chunk. Each chunk then enters the generator under its winning configuration selected during retrieval, yielding an interleaved evidence form where the chunk-level decision propagates across both stages. CARVE outperforms eight recent VideoRAG baselines, with the chunks supplied to the generator interleaving multiple configurations rather than sharing a single one, a behavior unattainable by query-level methods.