🤖 AI Summary
Large language models exhibit insufficient robustness to out-of-distribution inputs, and while machine-generated optimized prompts effectively steer model outputs, their compositional principles and internal mechanistic pathways remain poorly understood. This work systematically investigates the structure of optimized prompts and their in-model interpretation mechanisms via three complementary approaches: neural activation analysis, token frequency statistics, and cross-model representation trajectory tracking. We make two key discoveries: first, optimized prompts consistently rely heavily on punctuation marks and low-frequency nouns, and exhibit a shared, invariant representation evolution path across diverse instruction-tuned models; second, we identify a sparse, generalizable subset of neural activations that robustly discriminates optimized prompts from natural language across models and tasks. These findings establish an interpretable, transferable mechanistic foundation for enhancing controllability and robustness in large language models.
📝 Abstract
Modern language models (LMs) are not robust to out-of-distribution inputs. Machine generated (``optimized'') prompts can be used to modulate LM outputs and induce specific behaviors while appearing completely uninterpretable. In this work, we investigate the composition of optimized prompts, as well as the mechanisms by which LMs parse and build predictions from optimized prompts. We find that optimized prompts primarily consist of punctuation and noun tokens which are more rare in the training data. Internally, optimized prompts are clearly distinguishable from natural language counterparts based on sparse subsets of the model's activations. Across various families of instruction-tuned models, optimized prompts follow a similar path in how their representations form through the network.