Causal Language Control in Multilingual Transformers via Sparse Feature Steering

๐Ÿ“… 2025-07-17
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF

career value

176K/year
๐Ÿค– AI Summary
This work addresses the problem of deterministic language control in large language models (LLMs) under zero-shot settingsโ€”without explicit prompting or fine-tuning. We propose a neuron-level intervention method based on sparse autoencoders (SAEs), which, for the first time, identifies critical sparse features in the residual stream that dominantly govern language selection across multilingual LLMs (Gemma-2B/9B). We find that mid-to-late-layer residual streams and specific attention heads play a central role in language control. Feature localization and efficacy validation are performed using FastText for language identification and LaBSE for semantic similarity assessment. Experiments demonstrate that intervening on a single SAE feature achieves up to 90% language-switching success rate while preserving high semantic fidelity of generated text. This work establishes a novel, interpretable paradigm for controllable multilingual generation in LLMs.

Technology Category

Application Category

๐Ÿ“ Abstract
Deterministically controlling the target generation language of large multilingual language models (LLMs) remains a fundamental challenge, particularly in zero-shot settings where neither explicit language prompts nor fine-tuning are available. In this work, we investigate whether sparse autoencoder (SAE) features, previously shown to correlate with interpretable model behaviors, can be leveraged to steer the generated language of LLMs during inference. Leveraging pretrained SAEs on the residual streams of Gemma-2B and Gemma-9B, we identify features whose activations differ most significantly between English and four target languages: Chinese, Japanese, Spanish, and French. By modifying just a single SAE feature at one transformer layer, we achieve controlled language shifts with up to 90% success, as measured by FastText language classification, while preserving semantic fidelity according to LaBSE (Language-Agnostic BERT Sentence Embedding) similarity. Our analysis reveals that language steering is most effective in mid-to-late transformer layers and is amplified by specific attention heads disproportionately associated with language-sensitive SAE features. These results demonstrate the promise of sparse feature steering as a lightweight and interpretable mechanism for controllable multilingual generation.
Problem

Research questions and friction points this paper is trying to address.

Control language generation in multilingual models without fine-tuning
Identify sparse autoencoder features for language-specific steering
Achieve high success in language shifts preserving semantic fidelity
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses sparse autoencoder features for language control
Modifies single SAE feature for language shifts
Leverages mid-to-late transformer layers effectively
๐Ÿ”Ž Similar Papers
No similar papers found.