🤖 AI Summary
The nuclear industry possesses vast quantities of highly sensitive, low-resource unstructured text, which severely constrains the training and evaluation of large language models (LLMs) due to data scarcity and stringent privacy requirements.
Method: We propose the first synthetic question-answer (QA) pair generation framework tailored to the nuclear domain, jointly optimizing information fidelity and regulatory compliance. Our approach integrates LLM-driven information extraction, domain-adaptive question generation, multi-dimensional self-assessment of QA quality, and a nuclear-domain rule-guided post-processing module.
Contribution/Results: Evaluated on real nuclear documentation, the framework generates over 10,000 high signal-to-noise-ratio QA pairs. Downstream QA and knowledge retrieval tasks achieve significant accuracy improvements; the generated data has already been deployed to support multiple operational decision-assistance prototype systems.
📝 Abstract
The nuclear industry possesses a wealth of valuable information locked away in unstructured text data. This data, however, is not readily usable for advanced Large Language Model (LLM) applications that require clean, structured question-answer pairs for tasks like model training, fine-tuning, and evaluation. This paper explores how synthetic data generation can bridge this gap, enabling the development of robust LLMs for the nuclear domain. We discuss the challenges of data scarcity and privacy concerns inherent in the nuclear industry and how synthetic data provides a solution by transforming existing text data into usable Q&A pairs. This approach leverages LLMs to analyze text, extract key information, generate relevant questions, and evaluate the quality of the resulting synthetic dataset. By unlocking the potential of LLMs in the nuclear industry, synthetic data can pave the way for improved information retrieval, enhanced knowledge sharing, and more informed decision-making in this critical sector.