Neuronal Activation States as Sample Embeddings for Data Selection in Task-Specific Instruction Tuning

📅 2025-03-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the inefficiency of training data selection in task-specific instruction fine-tuning, this paper proposes, for the first time, a cognitive-mechanism-driven data selection paradigm inspired by neural co-activation in the human brain. Specifically, it leverages intermediate-layer neuron activation states of large language models (LLMs) as sample embeddings. High-dimensional activation vectors are extracted and subsequently employed in similarity-based scoring and submodular optimization to enable efficient, principled data filtering. Extensive experiments across multiple LLMs (LLaMA-2, Qwen), diverse NLP tasks (NLI, RE, QA), and varying selection ratios (10%–50%) demonstrate consistent superiority over baseline methods: instruction fine-tuning accuracy improves by 2.1–4.7 percentage points on average, data utilization efficiency increases by over 3×, and the approach exhibits enhanced generalization and robustness.

Technology Category

Application Category

📝 Abstract
Task-specific instruction tuning enhances the performance of large language models (LLMs) on specialized tasks, yet efficiently selecting relevant data for this purpose remains a challenge. Inspired by neural coactivation in the human brain, we propose a novel data selection method called NAS, which leverages neuronal activation states as embeddings for samples in the feature space. Extensive experiments show that NAS outperforms classical data selection methods in terms of both effectiveness and robustness across different models, datasets, and selection ratios.
Problem

Research questions and friction points this paper is trying to address.

Efficient data selection for task-specific instruction tuning.
Leveraging neuronal activation states as sample embeddings.
Improving performance and robustness of large language models.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses neuronal activation states as embeddings
Enhances data selection for task-specific tuning
Outperforms classical methods in effectiveness
🔎 Similar Papers
No similar papers found.
Da Ma
Da Ma
Assistant Professor, School of Medicine, Wake Forest University
Medical Image ComputingComputational NeuroanatomyRadiogenomicsNeurodegenerative Disease
G
Gonghu Shang
X-LANCE Lab, Department of Computer Science and Engineering, MoE Key Lab of Artificial Intelligence, SJTU AI Institute, Shanghai Jiao Tong University, Shanghai, China
Z
Zhi Chen
ByteDance
L
Libo Qin
School of Computer Science and Engineering, Central South University
Y
Yijie Luo
X-LANCE Lab, Department of Computer Science and Engineering, MoE Key Lab of Artificial Intelligence, SJTU AI Institute, Shanghai Jiao Tong University, Shanghai, China
Lei Pan
Lei Pan
Michigan Technological University
Wetting FilmFroth Flotationthin liquid filmsurface forcehydrophobic force
Shuai Fan
Shuai Fan
AISpeech Co., Ltd., Suzhou, China
L
Lu Chen
X-LANCE Lab, Department of Computer Science and Engineering, MoE Key Lab of Artificial Intelligence, SJTU AI Institute, Shanghai Jiao Tong University, Shanghai, China
K
Kai Yu
X-LANCE Lab, Department of Computer Science and Engineering, MoE Key Lab of Artificial Intelligence, SJTU AI Institute, Shanghai Jiao Tong University, Shanghai, China