Transferring Features Across Language Models With Model Stitching

📅 2025-06-07

📈 Citations: 0

✨ Influential: 0

career value

194K/year

🤖 AI Summary

This work addresses the low efficiency of cross-scale language model representation space alignment and feature reuse. We propose “model stitching”: an affine mapping between residual streams to enable efficient transfer of sparse autoencoders (SAEs), interpretability probes, and steering vectors across models of differing sizes. We first systematically reveal the high linear representational similarity across model scales. Our approach establishes a low-cost, high-fidelity feature migration paradigm that enables cross-model reuse of key interpretability components without retraining. Experiments show: (1) SAE training cost reduced by 50%; (2) transferred probes and steering vectors recover ≥95% of ground-truth performance; and (3) semantic, structural, and functional features exhibit quantifiable, distinct migration patterns. This work introduces a novel, scalable framework for model interpretability research grounded in representation-space alignment.

Technology Category

Application Category

📝 Abstract

In this work, we demonstrate that affine mappings between residual streams of language models is a cheap way to effectively transfer represented features between models. We apply this technique to transfer the weights of Sparse Autoencoders (SAEs) between models of different sizes to compare their representations. We find that small and large models learn highly similar representation spaces, which motivates training expensive components like SAEs on a smaller model and transferring to a larger model at a FLOPs savings. For example, using a small-to-large transferred SAE as initialization can lead to 50% cheaper training runs when training SAEs on larger models. Next, we show that transferred probes and steering vectors can effectively recover ground truth performance. Finally, we dive deeper into feature-level transferability, finding that semantic and structural features transfer noticeably differently while specific classes of functional features have their roles faithfully mapped. Overall, our findings illustrate similarities and differences in the linear representation spaces of small and large models and demonstrate a method for improving the training efficiency of SAEs.

Problem

Research questions and friction points this paper is trying to address.

Transferring features between language models via affine mappings

Comparing representations across different-sized models using SAEs

Analyzing feature-level transferability of semantic and structural features

Innovation

Methods, ideas, or system contributions that make the work stand out.

Affine mappings transfer features between models

Transfer SAEs between different-sized models

Small-to-large SAE transfer cuts training costs

🔎 Similar Papers

No similar papers found.