Feature-Level Insights into Artificial Text Detection with Sparse Autoencoders

📅 2025-03-05
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Addressing the core challenges of poor generalization and weak interpretability in AI-generated text detection (ATD), this work pioneers the integration of sparse autoencoders (SAEs) into ATD. We apply SAEs to the residual stream of Gemma-2-2b to disentangle highly discriminative, cross-model-stable implicit features. We propose a multidimensional attribution framework that jointly leverages domain- and model-specific statistical analysis, feature-guided intervention, and LLM-assisted semantic interpretation—enabling identification of multiple highly sparse, strongly activated, and semantically interpretable features indicative of AI-generated text. Empirical analysis reveals systematic stylistic biases in modern LLMs under information-dense conditions. Our approach significantly improves ATD generalization to unseen texts and novel LLMs while enhancing decision transparency and interpretability.

Technology Category

Application Category

📝 Abstract
Artificial Text Detection (ATD) is becoming increasingly important with the rise of advanced Large Language Models (LLMs). Despite numerous efforts, no single algorithm performs consistently well across different types of unseen text or guarantees effective generalization to new LLMs. Interpretability plays a crucial role in achieving this goal. In this study, we enhance ATD interpretability by using Sparse Autoencoders (SAE) to extract features from Gemma-2-2b residual stream. We identify both interpretable and efficient features, analyzing their semantics and relevance through domain- and model-specific statistics, a steering approach, and manual or LLM-based interpretation. Our methods offer valuable insights into how texts from various models differ from human-written content. We show that modern LLMs have a distinct writing style, especially in information-dense domains, even though they can produce human-like outputs with personalized prompts.
Problem

Research questions and friction points this paper is trying to address.

Enhance interpretability in Artificial Text Detection using Sparse Autoencoders.
Identify interpretable features from Gemma-2-2b residual stream for ATD.
Analyze differences between LLM-generated and human-written texts.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Sparse Autoencoders extract interpretable features
Analyze semantics using domain-specific statistics
Identify distinct writing styles of LLMs
🔎 Similar Papers
2024-06-21Journal of Artificial Intelligence ResearchCitations: 6