🤖 AI Summary
The utility, typology, and synergistic integration of self-generated documentation (Self-Docs) from large language models (LLMs) in retrieval-augmented generation (RAG) remain poorly understood—particularly for knowledge-intensive question answering. Method: This work establishes, for the first time, an interpretable Self-Docs typology grounded in systemic functional linguistics and designs a multi-dimensional RAG evaluation framework rigorously tested on Natural Questions and TriviaQA. Contribution/Results: Empirical analysis reveals substantial heterogeneity in performance gains across linguistically defined Self-Docs categories; notably, hybrid usage strategies—e.g., combining “definition” and “exemplification”—consistently outperform standard RAG baselines. The study provides both theoretical foundations and practical guidelines for controllable Self-Docs generation, semantics-aware filtering, and effective coordination with external retrieved documents.
📝 Abstract
The integration of documents generated by LLMs themselves (Self-Docs) alongside retrieved documents has emerged as a promising strategy for retrieval-augmented generation systems. However, previous research primarily focuses on optimizing the use of Self-Docs, with their inherent properties remaining underexplored. To bridge this gap, we first investigate the overall effectiveness of Self-Docs, identifying key factors that shape their contribution to RAG performance (RQ1). Building on these insights, we develop a taxonomy grounded in Systemic Functional Linguistics to compare the influence of various Self-Docs categories (RQ2) and explore strategies for combining them with external sources (RQ3). Our findings reveal which types of Self-Docs are most beneficial and offer practical guidelines for leveraging them to achieve significant improvements in knowledge-intensive question answering tasks.