Concept Tokens: Learning Behavioral Embeddings Through Concept Definitions
By: Ignacio Sastre, Aiala Rosá
Potential Business Impact:
Teaches computers new ideas without retraining them.
We propose Concept Tokens, a lightweight method that adds a new special token to a pretrained LLM and learns only its embedding from multiple natural language definitions of a target concept, where occurrences of the concept are replaced by the new token. The LLM is kept frozen and the embedding is optimized with the standard language-modeling objective. We evaluate Concept Tokens in three settings. First, we study hallucinations in closed-book question answering on HotpotQA and find a directional effect: negating the hallucination token reduces hallucinated answers mainly by increasing abstentions, whereas asserting it increases hallucinations and lowers precision. Second, we induce recasting, a pedagogical feedback strategy for second language teaching, and observe the same directional effect. Moreover, compared to providing the full definitional corpus in-context, concept tokens better preserve compliance with other instructions (e.g., asking follow-up questions). Finally, we include a qualitative study with the Eiffel Tower and a fictional "Austral Tower" to illustrate what information the learned embeddings capture and where their limitations emerge. Overall, Concept Tokens provide a compact control signal learned from definitions that can steer behavior in frozen LLMs.
Similar Papers
Vector Arithmetic in Concept and Token Subspaces
Computation and Language
Makes AI understand word meanings and spelling better.
Revealing emergent human-like conceptual representations from language prediction
Computation and Language
Computers learn ideas like people from just words.
Human-like conceptual representations emerge from language prediction
Computation and Language
Computers learn ideas like people from words.