🤖 AI Summary
While model compression techniques (e.g., pruning, quantization) improve the inference efficiency of large language models (LLMs), their impact on interpretability—particularly via sparse autoencoders (SAEs)—remains poorly understood.
Method: We systematically investigate the cross-compression-state transferability of SAEs trained on original LLMs to their pruned or quantized counterparts. We further propose structured pruning of the SAE itself—without retraining—to reduce its computational footprint while preserving explanatory capability.
Results: We find that SAEs trained on uncompressed models transfer effectively to compressed variants, with only marginal degradation in explanation quality. Moreover, pruning the SAE directly achieves comparable interpretability to SAEs trained from scratch on compressed models—yet at drastically lower training cost. This work is the first to empirically validate strong SAE transferability across LLM compression states and establishes a novel, cost-efficient paradigm for maintaining interpretability in deployment-grade LLMs, thereby enabling practical, trustworthy model analysis.
📝 Abstract
Modern LLMs face inference efficiency challenges due to their scale. To address this, many compression methods have been proposed, such as pruning and quantization. However, the effect of compression on a model's interpretability remains elusive. While several model interpretation approaches exist, such as circuit discovery, Sparse Autoencoders (SAEs) have proven particularly effective in decomposing a model's activation space into its feature basis. In this work, we explore the differences in SAEs for the original and compressed models. We find that SAEs trained on the original model can interpret the compressed model albeit with slight performance degradation compared to the trained SAE on the compressed model. Furthermore, simply pruning the original SAE itself achieves performance comparable to training a new SAE on the pruned model. This finding enables us to mitigate the extensive training costs of SAEs.