Pretraining Finnish ModernBERTs
By: Akseli Reunamo , Laura-Maria Peltonen , Hans Moen and more
Potential Business Impact:
Makes computers understand Finnish text better.
This paper reports on pretraining ModernBERT encoder models in six different sizes, ranging from 51M to 475M parameters, with a focus on limited multilingualism, emphasizing languages relevant to Finland. Our models are competitive with, or superior to, existing multilingual models. They outperform monolingual models on tasks that require a context longer than 512 tokens. We present empirical results on using different data in the final stage of training. The code and models are publicly released.
Similar Papers
mmBERT: A Modern Multilingual Encoder with Annealed Language Learning
Computation and Language
Helps computers understand over 1800 languages.
Retrofitting Small Multilingual Models for Retrieval: Matching 7B Performance with 300M Parameters
Computation and Language
Makes small computer models search languages better.
llm-jp-modernbert: A ModernBERT Model Trained on a Large-Scale Japanese Corpus with Long Context Length
Computation and Language
Helps computers understand longer Japanese sentences better.