Unlocking the Potential of Large Language Models in the Nuclear Industry with Synthetic Data

📅 2025-06-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
The nuclear industry possesses vast quantities of highly sensitive, low-resource unstructured text, which severely constrains the training and evaluation of large language models (LLMs) due to data scarcity and stringent privacy requirements. Method: We propose the first synthetic question-answer (QA) pair generation framework tailored to the nuclear domain, jointly optimizing information fidelity and regulatory compliance. Our approach integrates LLM-driven information extraction, domain-adaptive question generation, multi-dimensional self-assessment of QA quality, and a nuclear-domain rule-guided post-processing module. Contribution/Results: Evaluated on real nuclear documentation, the framework generates over 10,000 high signal-to-noise-ratio QA pairs. Downstream QA and knowledge retrieval tasks achieve significant accuracy improvements; the generated data has already been deployed to support multiple operational decision-assistance prototype systems.

Technology Category

Application Category

📝 Abstract
The nuclear industry possesses a wealth of valuable information locked away in unstructured text data. This data, however, is not readily usable for advanced Large Language Model (LLM) applications that require clean, structured question-answer pairs for tasks like model training, fine-tuning, and evaluation. This paper explores how synthetic data generation can bridge this gap, enabling the development of robust LLMs for the nuclear domain. We discuss the challenges of data scarcity and privacy concerns inherent in the nuclear industry and how synthetic data provides a solution by transforming existing text data into usable Q&A pairs. This approach leverages LLMs to analyze text, extract key information, generate relevant questions, and evaluate the quality of the resulting synthetic dataset. By unlocking the potential of LLMs in the nuclear industry, synthetic data can pave the way for improved information retrieval, enhanced knowledge sharing, and more informed decision-making in this critical sector.
Problem

Research questions and friction points this paper is trying to address.

Bridging unstructured nuclear text data to structured LLM inputs
Addressing data scarcity and privacy in nuclear LLM applications
Enabling nuclear knowledge sharing via synthetic Q&A generation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Synthetic data generation for LLM training
Transforming text into Q&A pairs automatically
Leveraging LLMs to evaluate synthetic data
🔎 Similar Papers
No similar papers found.
M
Muhammad Anwar
Data Analytics and AI, Digital Technology and Services, Ontario Power Generation, Pickering, Ontario, Canada
Daniel Lau
Daniel Lau
Databeam Professor of Electrical and Computer Engineering, University of Kentucky
signal and image processing
M
Mishca de Costa
Data Analytics and AI, Digital Technology and Services, Ontario Power Generation, Pickering, Ontario, Canada
I
Issam Hammad
Department of Engineering Mathematics and Internetworking, Faculty of Engineering, Dalhousie University, Halifax, Nova Scotia