🤖 AI Summary
This work addresses the challenge of effectively visualizing internal activation directions of large language models (LLMs) in discrete text space, where existing prompt optimization methods often converge to suboptimal solutions. To overcome this limitation, the authors propose ADAPT, a novel approach that uniquely integrates beam search initialization with adaptive gradient-guided mutation, specifically designed for LLM feature visualization. By optimizing input tokens to maximize activation along target directions, ADAPT substantially enhances both activation strength and semantic interpretability of generated samples. Experiments on the Gemma-2 2B model demonstrate that ADAPT consistently outperforms current methods across diverse network layers and types of sparse autoencoder latent variables, establishing the feasibility and efficacy of feature visualization for LLMs in discrete input spaces.
📝 Abstract
Understanding what features are encoded by learned directions in LLM activation space requires identifying inputs that strongly activate them. Feature visualization, which optimizes inputs to maximally activate a target direction, offers an alternative to costly dataset search approaches, but remains underexplored for LLMs due to the discrete nature of text. Furthermore, existing prompt optimization techniques are poorly suited to this domain, which is highly prone to local minima. To overcome these limitations, we introduce ADAPT, a hybrid method combining beam search initialization with adaptive gradient-guided mutation, designed around these failure modes. We evaluate on Sparse Autoencoder latents from Gemma 2 2B, proposing metrics grounded in dataset activation statistics to enable rigorous comparison, and show that ADAPT consistently outperforms prior methods across layers and latent types. Our results establish that feature visualization for LLMs is tractable, but requires design assumptions tailored to the domain.