Multi-component Causal Tracing in Large Language Models

📅 2026-06-01

📈 Citations: 0

✨ Influential: 0

career value

192K/year

🤖 AI Summary

This work addresses the challenge of systematically identifying the joint causal influence of multiple internal components—such as attention heads and MLP neurons—on specific performance metrics like accuracy and fairness in large language models. To this end, the authors propose a unified framework that reformulates the discrete component selection problem into a continuous optimization task via a soft intervention mechanism. By integrating causal tracing with metric-aware transformations, the framework efficiently searches for and outputs an optimal binary subset of components. Experimental results demonstrate that the proposed method significantly outperforms existing baselines across multiple evaluation metrics and accurately pinpoints combinations of components that exert the greatest influence on model behavior.

📝 Abstract

Causal tracing systematically intervenes on a large language model's (LLM's) internal representations to uncover and quantify the causal pathways linking specific inputs or computations to specific metrics of interest, quantifying the LLM's behavior. Building on previous single-component or single-layer studies, this paper presents a unified framework for causally tracing multiple components simultaneously. This framework systematically identifies the subsets of components (e.g., attention heads and multi-layer perceptron neurons) most critical to a desired target performance metric (e.g., accuracy and fairness). This is achieved by incorporating flexible interventions applied to a wide range of desired metrics. To address the combinatorial complexity of the multi-component problem, an efficient algorithm is designed that leverages soft interventions and a carefully designed metric transformation, converting the combinatorial search problem into a continuous one that can be solved efficiently under proper constraints, thereby generating proper binary decisions for selecting components. Experimental results demonstrate that the proposed method efficiently identifies subsets of the model's components that have a high impact on the target metric, outperforming existing baseline approaches. Our code is available at https://github.com/ZiruiYan/multi-component-causal-tracing.

Problem

Research questions and friction points this paper is trying to address.

causal tracing

large language models

multi-component analysis

model interpretability

attention heads

Innovation

Methods, ideas, or system contributions that make the work stand out.

causal tracing

multi-component intervention

soft intervention