Neuronal Activation States as Sample Embeddings for Data Selection in Task-Specific Instruction Tuning

📅 2025-03-19

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

To address the inefficiency of training data selection in task-specific instruction fine-tuning, this paper proposes, for the first time, a cognitive-mechanism-driven data selection paradigm inspired by neural co-activation in the human brain. Specifically, it leverages intermediate-layer neuron activation states of large language models (LLMs) as sample embeddings. High-dimensional activation vectors are extracted and subsequently employed in similarity-based scoring and submodular optimization to enable efficient, principled data filtering. Extensive experiments across multiple LLMs (LLaMA-2, Qwen), diverse NLP tasks (NLI, RE, QA), and varying selection ratios (10%–50%) demonstrate consistent superiority over baseline methods: instruction fine-tuning accuracy improves by 2.1–4.7 percentage points on average, data utilization efficiency increases by over 3×, and the approach exhibits enhanced generalization and robustness.

Technology Category

Application Category

📝 Abstract

Task-specific instruction tuning enhances the performance of large language models (LLMs) on specialized tasks, yet efficiently selecting relevant data for this purpose remains a challenge. Inspired by neural coactivation in the human brain, we propose a novel data selection method called NAS, which leverages neuronal activation states as embeddings for samples in the feature space. Extensive experiments show that NAS outperforms classical data selection methods in terms of both effectiveness and robustness across different models, datasets, and selection ratios.

Problem

Research questions and friction points this paper is trying to address.

Efficient data selection for task-specific instruction tuning.

Leveraging neuronal activation states as sample embeddings.

Improving performance and robustness of large language models.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses neuronal activation states as embeddings

Enhances data selection for task-specific tuning

Outperforms classical methods in effectiveness

🔎 Similar Papers

No similar papers found.

Authors to Follow