FlashBack: Efficient Retrieval-Augmented Language Modeling for Long Context Inference

📅 2024-05-07

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

176K/year

🤖 AI Summary

To address the inefficiency of retrieval-augmented language models (RALMs) in long-context reasoning—specifically, the KV cache invalidation and redundant computation caused by conventional prefix-concatenation of retrieved content—this paper proposes FlashBack, a modular RALM framework. Its core innovation is a “suffix-appending + marking token” context fusion paradigm: retrieved documents are appended to the end of the input sequence, and a learnable Marking Token is introduced to explicitly delineate the boundary between the query and retrieved content. This design enables full-path KV cache reuse across the entire RAG pipeline for the first time, eliminating redundant key-value computations. Integrated with LoRA fine-tuning and KV cache optimization, FlashBack achieves a 4× inference speedup over standard prefix-based RALMs on Llama-2-7B, while maintaining comparable perplexity. The approach significantly reduces latency and computational overhead for long-context RAG.

Technology Category

Application Category

📝 Abstract

Retrieval-Augmented Language Modeling (RALM) by integrating large language models (LLM) with relevant documents from an external corpus is a proven method for enabling the LLM to generate information beyond the scope of its pre-training corpus. Previous work utilizing retrieved content by simply prepending it to the input poses a high runtime issue, which degrades the inference efficiency of the LLMs because they fail to use the Key-Value (KV) cache efficiently. In this paper, we propose FlashBack, a modular RALM designed to improve the inference efficiency of RALM with appending context pattern while maintaining decent performance after fine-tuning by Low-Rank Adaption. FlashBack appends retrieved documents at the end of the context for efficiently utilizing the KV cache instead of prepending them. And we introduce Marking Token as two special prompt tokens for marking the boundary of the appending context during fine-tuning. Our experiments on testing generation quality show that FlashBack can remain decent generation quality in perplexity. And the inference speed of FlashBack is up to $4 imes$ faster than the prepending counterpart on a 7B LLM (Llama 2) in the runtime test. Via bypassing unnecessary re-computation, it demonstrates an advancement by achieving significantly faster inference speed, and this heightened efficiency will substantially reduce inferential cost.

Problem

Research questions and friction points this paper is trying to address.

Improves inference efficiency in retrieval-augmented language models

Reduces runtime by optimizing Key-Value cache usage

Maintains generation quality while speeding up inference

Innovation

Methods, ideas, or system contributions that make the work stand out.

Appends retrieved documents for KV cache efficiency

Uses Marking Token for context boundary identification

Achieves 4x faster inference via bypassing recomputation

🔎 Similar Papers

Efficient Length-Generalizable Attention via Causal Retrieval for Long-Context Language Modeling