Score: 0

PolyLingua: Margin-based Inter-class Transformer for Robust Cross-domain Language Detection

Published: December 9, 2025 | arXiv ID: 2512.08143v1

By: Ali Lotfi Rezaabad , Bikram Khanal , Shashwat Chaurasia and more

Language identification is a crucial first step in multilingual systems such as chatbots and virtual assistants, enabling linguistically and culturally accurate user experiences. Errors at this stage can cascade into downstream failures, setting a high bar for accuracy. Yet, existing language identification tools struggle with key cases -- such as music requests where the song title and user language differ. Open-source tools like LangDetect, FastText are fast but less accurate, while large language models, though effective, are often too costly for low-latency or low-resource settings. We introduce PolyLingua, a lightweight Transformer-based model for in-domain language detection and fine-grained language classification. It employs a two-level contrastive learning framework combining instance-level separation and class-level alignment with adaptive margins, yielding compact and well-separated embeddings even for closely related languages. Evaluated on two challenging datasets -- Amazon Massive (multilingual digital assistant utterances) and a Song dataset (music requests with frequent code-switching) -- PolyLingua achieves 99.25% F1 and 98.15% F1, respectively, surpassing Sonnet 3.5 while using 10x fewer parameters, making it ideal for compute- and latency-constrained environments.

Translate, then Detect: Leveraging Machine Translation for Cross-Lingual Toxicity Classification

Computation and Language

Helps computers spot bad online talk in many languages.

17 Sep 2025 1

87%

Lingua Custodi's participation at the WMT 2025 Terminology shared task

Computation and Language

Lets computers understand sentences in many languages.

20 Oct 2025 0

87%

PolyTruth: Multilingual Disinformation Detection using Transformer-Based Language Models

Computation and Language

AI spots fake news in many languages.

12 Sep 2025 0

View PDF Login to Bookmark

PolyLingua: Margin-based Inter-class Transformer for Robust Cross-domain Language Detection

Technical Abstract

Translate, then Detect: Leveraging Machine Translation for Cross-Lingual Toxicity Classification

Lingua Custodi's participation at the WMT 2025 Terminology shared task

PolyTruth: Multilingual Disinformation Detection using Transformer-Based Language Models