π€ AI Summary
To address low retrieval efficiency, high redundancy, and poor traceability in spectral analysis knowledge discovery, this paper proposes a trustworthy question-answering system tailored to the domain. Methodologically: (1) we release SDAAPβthe first open-source textual knowledge dataset specifically for spectral analysis; (2) we introduce a knowledge-traceable Retrieval-Augmented Generation (RAG) framework that constrains large language models (LLMs) to general-purpose generation while enabling precise, anchor-based retrieval through joint entity recognition and a domain-specific knowledge graph; (3) we enhance domain expertise via fine-tuning and prompt engineering. Experimental results demonstrate that our system significantly outperforms baseline models in answer accuracy, domain specificity, and source traceability: every generated answer is precisely grounded in original literature passages, thereby enabling efficient and reliable scientific knowledge acquisition.
π Abstract
Large Language Model (LLM) has demonstrated significant success in a range of natural language processing (NLP) tasks within general domain. The emergence of LLM has introduced innovative methodologies across diverse fields, including the natural sciences. Researchers aim to implement automated, concurrent process driven by LLM to supplant conventional manual, repetitive and labor-intensive work. In the domain of spectral analysis and detection, it is imperative for researchers to autonomously acquire pertinent knowledge across various research objects, which encompasses the spectroscopic techniques and the chemometric methods that are employed in experiments and analysis. Paradoxically, despite the recognition of spectroscopic detection as an effective analytical method, the fundamental process of knowledge retrieval remains both time-intensive and repetitive. In response to this challenge, we first introduced the Spectral Detection and Analysis Based Paper(SDAAP) dataset, which is the first open-source textual knowledge dataset for spectral analysis and detection and contains annotated literature data as well as corresponding knowledge instruction data. Subsequently, we also designed an automated Q&A framework based on the SDAAP dataset, which can retrieve relevant knowledge and generate high-quality responses by extracting entities in the input as retrieval parameters. It is worth noting that: within this framework, LLM is only used as a tool to provide generalizability, while RAG technique is used to accurately capture the source of the knowledge.This approach not only improves the quality of the generated responses, but also ensures the traceability of the knowledge. Experimental results show that our framework generates responses with more reliable expertise compared to the baseline.