🤖 AI Summary
This work addresses the challenge of reliably identifying and attributing content generated by large language models without compromising textual quality. The authors propose a lossless fingerprinting method that embeds detectable, semantically non-intrusive identification signals by injecting random sparse vectors into the residual stream, thereby encoding fingerprints within the model’s activation space. This approach demonstrates, for the first time, that large language models can achieve high-precision self-identification and support multi-model attribution. Experimental results show attribution accuracy exceeding 98% across diverse detection settings, while preserving the fluency and coherence of generated text with no statistically significant degradation in quality.
📝 Abstract
Recent advances in interpretability suggest that large language models (LLMs) implicitly encode signals in their generated text that enable self-recognition of their outputs. We demonstrate that this capability is reliable, even in low-entropy scenarios, and that it can be amplified through targeted intervention. By steering the internal residual stream during generation with a random sparse vector, we create a detectable fingerprint that enables attribution of a given text to a specific LLM. This signal is recoverable from the activations of an LLM used as a detector, achieving over 98% accuracy across multiple detection settings while preserving the quality of generated text. As AI-generated content proliferates, this approach offers a practical alternative to traditional detectors by leveraging the model's natural representation structure for attribution rather than embedding a signal externally. Our contributions include: (i) establishing reliable self-recognition capabilities in LLMs, (ii) a simple steering mechanism enabling multi-LLM identification with no quality degradation, (iii) demonstrating that activation spaces contain exploitable structure for encoding signals without semantic interference.