Measuring Intrinsic Dimension of Token Embeddings
By: Takuya Kataiwa, Cho Hakaze, Tetsushi Ohki
Potential Business Impact:
Finds ways to make AI smarter and learn faster.
In this study, we measure the Intrinsic Dimension (ID) of token embedding to estimate the intrinsic dimensions of the manifolds spanned by the representations, so as to evaluate their redundancy quantitatively compared to their extrinsic dimensionality. In detail, (1) we estimate the ID of token embeddings in small-scale language models and also modern large language models, finding that the embedding spaces often reside on lower-dimensional manifolds compared to their extrinsic dimensionality; (2) we measure the ID across various model sizes and observe an increase in redundancy rates as the model scale grows; (3) we measure the dynamics of IDs during the training process, and find a rapid ID drop in the early stages of training. Moreover, (4) when LoRA is applied to the embedding layers, we observe a sudden drop in perplexity around the estimated IDs, suggesting that the ID can serve as a useful guideline for LoRA application.
Similar Papers
Memorization in Language Models through the Lens of Intrinsic Dimension
Computation and Language
Makes AI remember less private stuff it learned.
Unveiling Intrinsic Dimension of Texts: from Academic Abstract to Creative Story
Computation and Language
Makes computers understand writing styles better.
Less is More: Local Intrinsic Dimensions of Contextual Language Models
Computation and Language
Shows how computers learn and improve.