Score: 2

HalleluBERT: Let every token that has meaning bear its weight

Published: October 24, 2025 | arXiv ID: 2510.21372v1

By: Raphael Scheible-Schmitt

Potential Business Impact:

Makes computers understand Hebrew text much better.

Business Areas:
Natural Language Processing Artificial Intelligence, Data and Analytics, Software

Transformer-based models have advanced NLP, yet Hebrew still lacks a large-scale RoBERTa encoder which is extensively trained. Existing models such as HeBERT, AlephBERT, and HeRo are limited by corpus size, vocabulary, or training depth. We present HalleluBERT, a RoBERTa-based encoder family (base and large) trained from scratch on 49.1~GB of deduplicated Hebrew web text and Wikipedia with a Hebrew-specific byte-level BPE vocabulary. Evaluated on NER and sentiment classification benchmarks, HalleluBERT outperforms both monolingual and multilingual baselines. HalleluBERT sets a new state of the art for Hebrew and highlights the benefits of fully converged monolingual pretraining.

Repos / Data Links

Page Count
9 pages

Category
Computer Science:
Computation and Language