How LLMs Learn: Tracing Internal Representations with Sparse Autoencoders
By: Tatsuro Inaba , Kentaro Inui , Yusuke Miyao and more
Potential Business Impact:
Helps computers learn languages and ideas better.
Large Language Models (LLMs) demonstrate remarkable multilingual capabilities and broad knowledge. However, the internal mechanisms underlying the development of these capabilities remain poorly understood. To investigate this, we analyze how the information encoded in LLMs' internal representations evolves during the training process. Specifically, we train sparse autoencoders at multiple checkpoints of the model and systematically compare the interpretative results across these stages. Our findings suggest that LLMs initially acquire language-specific knowledge independently, followed by cross-linguistic correspondences. Moreover, we observe that after mastering token-level knowledge, the model transitions to learning higher-level, abstract concepts, indicating the development of more conceptual understanding.
Similar Papers
Unveiling Language-Specific Features in Large Language Models via Sparse Autoencoders
Computation and Language
Makes computers speak only one language at a time.
Tracing Multilingual Representations in LLMs with Cross-Layer Transcoders
Computation and Language
Makes computers understand many languages equally.
Geospatial Mechanistic Interpretability of Large Language Models
Machine Learning (CS)
Shows how computers "see" maps and places.