Leveraging Large Language Models to Address Data Scarcity in Machine Learning: Applications in Graphene Synthesis

📅 2025-03-06

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Graphene CVD synthesis suffers from scarce experimental data, heterogeneous literature data (inconsistent formats and variable quality), and mixed continuous-discrete features—posing severe challenges for small-sample modeling. Method: We propose an LLM-driven feature homogenization framework: (i) a novel missing-value imputation via prompt-based completion; (ii) semantic embedding encoding of substrate terminology; (iii) GPT-4 prompt engineering, SVM classification, and multi-strategy data augmentation. Contribution/Results: Our method maps heterogeneous literature data into a unified feature space, enabling robust small-sample learning. On layer-number prediction, binary classification accuracy improves from 39% to 65%, and ternary classification from 52% to 72%—substantially outperforming direct LLM fine-tuning. To our knowledge, this is the first work applying LLMs to small-sample materials feature engineering—not end-to-end prediction—establishing a transferable methodology for data-scarce materials discovery.

Technology Category

Application Category

📝 Abstract

Machine learning in materials science faces challenges due to limited experimental data, as generating synthesis data is costly and time-consuming, especially with in-house experiments. Mining data from existing literature introduces issues like mixed data quality, inconsistent formats, and variations in reporting experimental parameters, complicating the creation of consistent features for the learning algorithm. Additionally, combining continuous and discrete features can hinder the learning process with limited data. Here, we propose strategies that utilize large language models (LLMs) to enhance machine learning performance on a limited, heterogeneous dataset of graphene chemical vapor deposition synthesis compiled from existing literature. These strategies include prompting modalities for imputing missing data points and leveraging large language model embeddings to encode the complex nomenclature of substrates reported in chemical vapor deposition experiments. The proposed strategies enhance graphene layer classification using a support vector machine (SVM) model, increasing binary classification accuracy from 39% to 65% and ternary accuracy from 52% to 72%. We compare the performance of the SVM and a GPT-4 model, both trained and fine-tuned on the same data. Our results demonstrate that the numerical classifier, when combined with LLM-driven data enhancements, outperforms the standalone LLM predictor, highlighting that in data-scarce scenarios, improving predictive learning with LLM strategies requires more than simple fine-tuning on datasets. Instead, it necessitates sophisticated approaches for data imputation and feature space homogenization to achieve optimal performance. The proposed strategies emphasize data enhancement techniques, offering a broadly applicable framework for improving machine learning performance on scarce, inhomogeneous datasets.

Problem

Research questions and friction points this paper is trying to address.

Address data scarcity in machine learning for materials science.

Improve machine learning performance on limited, heterogeneous datasets.

Enhance graphene synthesis classification using large language models.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Utilize LLMs for data imputation and feature encoding.

Enhance SVM classification with LLM-driven data improvements.

Combine numerical classifiers with LLM strategies for better accuracy.

🔎 Similar Papers

No similar papers found.

Authors to Follow