Explaining Fine Tuned LLMs via Counterfactuals A Knowledge Graph Driven Framework

📅 2025-09-25

📈 Citations: 0

✨ Influential: 0

career value

199K/year

🤖 AI Summary

The impact of Low-Rank Adaptation (LoRA) fine-tuning on the structural reasoning and semantic behavior of large language models (LLMs) remains poorly understood. Method: We propose CFFTLLMExplainer, the first explainable framework integrating counterfactual reasoning with knowledge graphs. It constructs BioToolKG—a domain-specific heterogeneous knowledge graph in bioinformatics—and introduces a soft masking mechanism to induce minimal structural perturbations that maximize semantic change. Structural sparsity and semantic interpretability are jointly optimized via entropy regularization and edge-smoothing constraints. Results: Evaluated on LoRA-fine-tuned LLaMA models, counterfactual masks precisely identify structural dependencies critical to model behavior, exhibiting strong alignment with actual LoRA parameter updates. This work establishes a novel paradigm for interpreting fine-tuned LLMs—one grounded in theoretical rigor and practical interpretability—advancing both mechanistic understanding and trustworthy deployment of adaptive LLMs.

Technology Category

Application Category

📝 Abstract

The widespread adoption of Low-Rank Adaptation (LoRA) has enabled large language models (LLMs) to acquire domain-specific knowledge with remarkable efficiency. However, understanding how such a fine-tuning mechanism alters a model's structural reasoning and semantic behavior remains an open challenge. This work introduces a novel framework that explains fine-tuned LLMs via counterfactuals grounded in knowledge graphs. Specifically, we construct BioToolKG, a domain-specific heterogeneous knowledge graph in bioinformatics tools and design a counterfactual-based fine-tuned LLMs explainer (CFFTLLMExplainer) that learns soft masks over graph nodes and edges to generate minimal structural perturbations that induce maximum semantic divergence. Our method jointly optimizes structural sparsity and semantic divergence while enforcing interpretability preserving constraints such as entropy regularization and edge smoothness. We apply this framework to a fine-tuned LLaMA-based LLM and reveal that counterfactual masking exposes the model's structural dependencies and aligns with LoRA-induced parameter shifts. This work provides new insights into the internal mechanisms of fine-tuned LLMs and highlights counterfactual graphs as a potential tool for interpretable AI.

Problem

Research questions and friction points this paper is trying to address.

Explaining how fine-tuning alters LLMs' structural reasoning and semantic behavior

Understanding domain-specific knowledge acquisition through LoRA fine-tuning mechanisms

Developing interpretable AI methods to reveal fine-tuned LLMs' internal mechanisms

Innovation

Methods, ideas, or system contributions that make the work stand out.

Counterfactual explanations using knowledge graphs

Soft mask learning for minimal structural perturbations

Joint optimization of sparsity and semantic divergence

🔎 Similar Papers

Evaluating the Reliability of Self-Explanations in Large Language Models