PolyLingua: Margin-based Inter-class Transformer for Robust Cross-domain Language Detection
By: Ali Lotfi Rezaabad , Bikram Khanal , Shashwat Chaurasia and more
Language identification is a crucial first step in multilingual systems such as chatbots and virtual assistants, enabling linguistically and culturally accurate user experiences. Errors at this stage can cascade into downstream failures, setting a high bar for accuracy. Yet, existing language identification tools struggle with key cases -- such as music requests where the song title and user language differ. Open-source tools like LangDetect, FastText are fast but less accurate, while large language models, though effective, are often too costly for low-latency or low-resource settings. We introduce PolyLingua, a lightweight Transformer-based model for in-domain language detection and fine-grained language classification. It employs a two-level contrastive learning framework combining instance-level separation and class-level alignment with adaptive margins, yielding compact and well-separated embeddings even for closely related languages. Evaluated on two challenging datasets -- Amazon Massive (multilingual digital assistant utterances) and a Song dataset (music requests with frequent code-switching) -- PolyLingua achieves 99.25% F1 and 98.15% F1, respectively, surpassing Sonnet 3.5 while using 10x fewer parameters, making it ideal for compute- and latency-constrained environments.
Similar Papers
Translate, then Detect: Leveraging Machine Translation for Cross-Lingual Toxicity Classification
Computation and Language
Helps computers spot bad online talk in many languages.
Lingua Custodi's participation at the WMT 2025 Terminology shared task
Computation and Language
Lets computers understand sentences in many languages.
PolyTruth: Multilingual Disinformation Detection using Transformer-Based Language Models
Computation and Language
AI spots fake news in many languages.