HalleluBERT: Let every token that has meaning bear its weight
By: Raphael Scheible-Schmitt
Potential Business Impact:
Makes computers understand Hebrew text much better.
Transformer-based models have advanced NLP, yet Hebrew still lacks a large-scale RoBERTa encoder which is extensively trained. Existing models such as HeBERT, AlephBERT, and HeRo are limited by corpus size, vocabulary, or training depth. We present HalleluBERT, a RoBERTa-based encoder family (base and large) trained from scratch on 49.1~GB of deduplicated Hebrew web text and Wikipedia with a Hebrew-specific byte-level BPE vocabulary. Evaluated on NER and sentiment classification benchmarks, HalleluBERT outperforms both monolingual and multilingual baselines. HalleluBERT sets a new state of the art for Hebrew and highlights the benefits of fully converged monolingual pretraining.
Similar Papers
NeoDictaBERT: Pushing the Frontier of BERT models for Hebrew
Computation and Language
Helps computers understand Hebrew text better.
Scaling HuBERT for African Languages: From Base to Large and XL
Computation and Language
Makes computers understand many African languages better.
EuroBERT: Scaling Multilingual Encoders for European Languages
Computation and Language
Helps computers understand many languages for math and code.