AI-Generated Text Detection in Low-Resource Languages: A Case Study on Urdu
By: Muhammad Ammar, Hadiya Murad Hadi, Usman Majeed Butt
Potential Business Impact:
Finds fake writing in Urdu.
Large Language Models (LLMs) are now capable of generating text that closely resembles human writing, making them powerful tools for content creation, but this growing ability has also made it harder to tell whether a piece of text was written by a human or by a machine. This challenge becomes even more serious for languages like Urdu, where there are very few tools available to detect AI-generated text. To address this gap, we propose a novel AI-generated text detection framework tailored for the Urdu language. A balanced dataset comprising 1,800 humans authored, and 1,800 AI generated texts, sourced from models such as Gemini, GPT-4o-mini, and Kimi AI was developed. Detailed linguistic and statistical analysis was conducted, focusing on features such as character and word counts, vocabulary richness (Type Token Ratio), and N-gram patterns, with significance evaluated through t-tests and MannWhitney U tests. Three state-of-the-art multilingual transformer models such as mdeberta-v3-base, distilbert-base-multilingualcased, and xlm-roberta-base were fine-tuned on this dataset. The mDeBERTa-v3-base achieved the highest performance, with an F1-score 91.29 and accuracy of 91.26% on the test set. This research advances efforts in contesting misinformation and academic misconduct in Urdu-speaking communities and contributes to the broader development of NLP tools for low resource languages.
Similar Papers
Unified Large Language Models for Misinformation Detection in Low-Resource Linguistic Settings
Computation and Language
Helps find fake news in Urdu.
AI-generated Text Detection: A Multifaceted Approach to Binary and Multiclass Classification
Computation and Language
Finds if writing is from a person or AI.
Irony Detection in Urdu Text: A Comparative Study Using Machine Learning Models and Large Language Models
Computation and Language
Helps computers understand jokes in Urdu.