Modeling Language as a Sequence of Thoughts
By: Nasim Borazjanizadeh, James McClelland
Transformer language models can generate strikingly natural text by modeling language as a sequence of tokens. Yet, by relying primarily on surface-level co-occurrence statistics, they fail to form globally consistent latent representations of entities and events, lack of which contributes to brittleness in relational direction (e.g., reversal curse), contextualization errors, and data inefficiency. On the other hand, cognitive science shows that human comprehension involves converting the input linguistic stream into compact, event-like representations that persist in memory while verbatim form is short-lived. Motivated by this view, we introduce Thought Gestalt (TG) model, a recurrent Transformer that models language at two levels of abstraction - tokens and sentence-level "thought" states. TG generates the tokens of one sentence at a time while cross-attending to a memory of prior sentence representations. In TG, token and sentence representations are generated using the same set of model parameters and trained with a single objective, the next-token cross-entropy: by retaining the computation graph of sentence representations written to memory, gradients from future token losses flow backward through cross-attention to optimize the parameters generating earlier sentence vectors. In scaling experiments, TG consistently improves efficiency over matched GPT-2 runs, among other baselines, with scaling fits indicating GPT-2 requires ~5-8% more data and ~33-42% more parameters to match TG's loss. TG also reduces errors on relational direction generalization on a father-son reversal curse probe.
Similar Papers
Latent Thought Models with Variational Bayes Inference-Time Computation
Computation and Language
Computers learn to think and reason better.
LLM Reasoning for Machine Translation: Synthetic Data Generation over Thinking Tokens
Computation and Language
Makes computer translators better by showing them how.
Idea-Gated Transformers: Enforcing Semantic Coherence via Differentiable Vocabulary Pruning
Computation and Language
Keeps AI writing focused on the main topic.