The Arabic AI Fingerprint: Stylometric Analysis and Detection of Large Language Models Text
By: Maged S. Al-Shaibani, Moataz Ahmed
Potential Business Impact:
Finds fake Arabic writing made by computers.
Large Language Models (LLMs) have achieved unprecedented capabilities in generating human-like text, posing subtle yet significant challenges for information integrity across critical domains, including education, social media, and academia, enabling sophisticated misinformation campaigns, compromising healthcare guidance, and facilitating targeted propaganda. This challenge becomes severe, particularly in under-explored and low-resource languages like Arabic. This paper presents a comprehensive investigation of Arabic machine-generated text, examining multiple generation strategies (generation from the title only, content-aware generation, and text refinement) across diverse model architectures (ALLaM, Jais, Llama, and GPT-4) in academic, and social media domains. Our stylometric analysis reveals distinctive linguistic patterns differentiating human-written from machine-generated Arabic text across these varied contexts. Despite their human-like qualities, we demonstrate that LLMs produce detectable signatures in their Arabic outputs, with domain-specific characteristics that vary significantly between different contexts. Based on these insights, we developed BERT-based detection models that achieved exceptional performance in formal contexts (up to 99.9\% F1-score) with strong precision across model architectures. Our cross-domain analysis confirms generalization challenges previously reported in the literature. To the best of our knowledge, this work represents the most comprehensive investigation of Arabic machine-generated text to date, uniquely combining multiple prompt generation methods, diverse model architectures, and in-depth stylometric analysis across varied textual domains, establishing a foundation for developing robust, linguistically-informed detection systems essential for preserving information integrity in Arabic-language contexts.
Similar Papers
Large Language Models and Arabic Content: A Review
Computation and Language
Helps computers understand and use Arabic language better.
The Landscape of Arabic Large Language Models (ALLMs): A New Era for Arabic Language Technology
Computation and Language
Helps computers understand and speak Arabic better.
Detecting Stylistic Fingerprints of Large Language Models
Computation and Language
Finds out if computers wrote a text.