Feature-Level Insights into Artificial Text Detection with Sparse Autoencoders
By: Kristian Kuznetsov , Laida Kushnareva , Polina Druzhinina and more
Potential Business Impact:
Finds fake writing from smart computer programs.
Artificial Text Detection (ATD) is becoming increasingly important with the rise of advanced Large Language Models (LLMs). Despite numerous efforts, no single algorithm performs consistently well across different types of unseen text or guarantees effective generalization to new LLMs. Interpretability plays a crucial role in achieving this goal. In this study, we enhance ATD interpretability by using Sparse Autoencoders (SAE) to extract features from Gemma-2-2b residual stream. We identify both interpretable and efficient features, analyzing their semantics and relevance through domain- and model-specific statistics, a steering approach, and manual or LLM-based interpretation. Our methods offer valuable insights into how texts from various models differ from human-written content. We show that modern LLMs have a distinct writing style, especially in information-dense domains, even though they can produce human-like outputs with personalized prompts.
Similar Papers
Interpretable Embeddings with Sparse Autoencoders: A Data Analysis Toolkit
Artificial Intelligence
Finds hidden ideas in text data.
Sparse Autoencoders are Topic Models
CV and Pattern Recognition
Finds hidden themes in pictures and words.
Enabling Precise Topic Alignment in Large Language Models Via Sparse Autoencoders
Computation and Language
Makes AI talk about any topic you want.