Beyond Tokens in Language Models: Interpreting Activations through Text Genre Chunks
By: Éloïse Benito-Rodriguez , Einar Urdshals , Jasmina Nasufi and more
Potential Business Impact:
Predicts text style from AI's thoughts.
Understanding Large Language Models (LLMs) is key to ensure their safe and beneficial deployment. This task is complicated by the difficulty of interpretability of LLM structures, and the inability to have all their outputs human-evaluated. In this paper, we present the first step towards a predictive framework, where the genre of a text used to prompt an LLM, is predicted based on its activations. Using Mistral-7B and two datasets, we show that genre can be extracted with F1-scores of up to 98% and 71% using scikit-learn classifiers. Across both datasets, results consistently outperform the control task, providing a proof of concept that text genres can be inferred from LLMs with shallow learning models.
Similar Papers
Activations as Features: Probing LLMs for Generalizable Essay Scoring Representations
Computation and Language
Helps computers grade essays fairly, even with different questions.
LLMs Know More Than Words: A Genre Study with Syntax, Metaphor & Phonetics
Computation and Language
Helps computers understand poetry and stories better.
Activation Oracles: Training and Evaluating LLMs as General-Purpose Activation Explainers
Computation and Language
Lets computers explain their own thinking.