SPAR: Session-based Pipeline for Adaptive Retrieval on Legacy File Systems

📅 2025-12-14
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Legacy enterprise document systems suffer from unstructured historical data and absent semantic indexing, resulting in inefficient, error-prone retrieval and poor decision-support capability. To address this, we propose a lightweight, two-stage adaptive retrieval framework: (1) constructing a semantic metadata index, followed by (2) session-aware dynamic vector library generation—avoiding costly full-document vectorization. Our novel “session-driven dynamic vector library generation” mechanism enhances retrieval accuracy while improving system controllability and deployment feasibility. We theoretically prove its computational complexity is significantly lower than conventional RAG, thereby increasing transparency. The method integrates large language models (LLMs), semantic modeling, and session-aware retrieval. Evaluated on a synthetic biomedical literature dataset emulating enterprise-scale data, our approach achieves substantial gains in retrieval accuracy and downstream task performance, demonstrating both efficiency and practical applicability.

Technology Category

Application Category

📝 Abstract
The ability to extract value from historical data is essential for enterprise decision-making. However, much of this information remains inaccessible within large legacy file systems that lack structured organization and semantic indexing, making retrieval and analysis inefficient and error-prone. We introduce SPAR (Session-based Pipeline for Adaptive Retrieval), a conceptual framework that integrates Large Language Models (LLMs) into a Retrieval-Augmented Generation (RAG) architecture specifically designed for legacy enterprise environments. Unlike conventional RAG pipelines, which require costly construction and maintenance of full-scale vector databases that mirror the entire file system, SPAR employs a lightweight two-stage process: a semantic Metadata Index is first created, after which session-specific vector databases are dynamically generated on demand. This design reduces computational overhead while improving transparency, controllability, and relevance in retrieval. We provide a theoretical complexity analysis comparing SPAR with standard LLM-based RAG pipelines, demonstrating its computational advantages. To validate the framework, we apply SPAR to a synthesized enterprise-scale file system containing a large corpus of biomedical literature, showing improvements in both retrieval effectiveness and downstream model accuracy. Finally, we discuss design trade-offs and outline open challenges for deploying SPAR across diverse enterprise settings.
Problem

Research questions and friction points this paper is trying to address.

Efficiently retrieving unstructured data from legacy file systems
Reducing computational costs of retrieval-augmented generation in enterprises
Improving accuracy and relevance in historical data analysis
Innovation

Methods, ideas, or system contributions that make the work stand out.

Integrates LLMs into RAG for legacy file systems
Uses lightweight two-stage process with dynamic vector databases
Improves retrieval transparency, controllability, and relevance
🔎 Similar Papers
No similar papers found.
Duy A. Nguyen
Duy A. Nguyen
PhD Candidate, CS @ UIUC
Machine LearningMultimodal LearningLLM
H
Hai H. Do
School of Communication and Information Technology, Hanoi University of Science and Technology
M
Minh Doan
Bioimaging Analytics, GlaxoSmithKline, Collegeville, PA, USA
Minh N. Do
Minh N. Do
Professor, University of Illinois at Urbana-Champaign and VinUniversity
signal processingcomputational imagingmachine perceptiondata science