The Geometries of Truth Are Orthogonal Across Tasks
By: Waiss Azizian , Michael Kirchhof , Eugene Ndiaye and more
Potential Business Impact:
Makes AI answers more trustworthy by checking its thinking.
Large Language Models (LLMs) have demonstrated impressive generalization capabilities across various tasks, but their claim to practical relevance is still mired by concerns on their reliability. Recent works have proposed examining the activations produced by an LLM at inference time to assess whether its answer to a question is correct. Some works claim that a "geometry of truth" can be learned from examples, in the sense that the activations that generate correct answers can be distinguished from those leading to mistakes with a linear classifier. In this work, we underline a limitation of these approaches: we observe that these "geometries of truth" are intrinsically task-dependent and fail to transfer across tasks. More precisely, we show that linear classifiers trained across distinct tasks share little similarity and, when trained with sparsity-enforcing regularizers, have almost disjoint supports. We show that more sophisticated approaches (e.g., using mixtures of probes and tasks) fail to overcome this limitation, likely because activation vectors commonly used to classify answers form clearly separated clusters when examined across tasks.
Similar Papers
Probing the Geometry of Truth: Consistency and Generalization of Truth Directions in LLMs Across Logical Transformations and Question Answering Tasks
Computation and Language
Makes computers tell the truth more often.
Exploring the generalization of LLM truth directions on conversational formats
Computation and Language
Helps computers spot lies, even in long talks.
Do Large Language Models Truly Understand Geometric Structures?
Computation and Language
Teaches computers to understand shapes and space.