Large Language Models Encode Semantics in Low-Dimensional Linear Subspaces
By: Baturay Saglam , Paul Kassianik , Blaine Nelson and more
Potential Business Impact:
Makes AI safer by finding bad ideas inside.
Understanding the latent space geometry of large language models (LLMs) is key to interpreting their behavior and improving alignment. However, it remains unclear to what extent LLMs internally organize representations related to semantic understanding. To explore this, we conduct a large-scale empirical study of hidden representations in 11 autoregressive models across 6 scientific topics. We find that high-level semantic information consistently resides in low-dimensional subspaces that form linearly separable representations across domains. This separability becomes more pronounced in deeper layers and under prompts that elicit structured reasoning or alignment behavior$\unicode{x2013}$even when surface content remains unchanged. These findings support geometry-aware tools that operate directly in latent space to detect and mitigate harmful or adversarial content. As a proof of concept, we train an MLP probe on final-layer hidden states to act as a lightweight latent-space guardrail. This approach substantially improves refusal rates on malicious queries and prompt injections that bypass both the model's built-in safety alignment and external token-level filters.
Similar Papers
Emotions Where Art Thou: Understanding and Characterizing the Emotional Latent Space of Large Language Models
Computation and Language
Teaches computers to understand and change feelings.
Visualizing LLM Latent Space Geometry Through Dimensionality Reduction
Machine Learning (CS)
Shows how computer language brains think and learn.
Linear Spatial World Models Emerge in Large Language Models
Artificial Intelligence
Computers learn how objects are arranged in space.