Code-Mix Sentiment Analysis on Hinglish Tweets
By: Aashi Garg , Aneshya Das , Arshi Arya and more
Potential Business Impact:
Helps companies understand what people say online.
The effectiveness of brand monitoring in India is increasingly challenged by the rise of Hinglish--a hybrid of Hindi and English--used widely in user-generated content on platforms like Twitter. Traditional Natural Language Processing (NLP) models, built for monolingual data, often fail to interpret the syntactic and semantic complexity of this code-mixed language, resulting in inaccurate sentiment analysis and misleading market insights. To address this gap, we propose a high-performance sentiment classification framework specifically designed for Hinglish tweets. Our approach fine-tunes mBERT (Multilingual BERT), leveraging its multilingual capabilities to better understand the linguistic diversity of Indian social media. A key component of our methodology is the use of subword tokenization, which enables the model to effectively manage spelling variations, slang, and out-of-vocabulary terms common in Romanized Hinglish. This research delivers a production-ready AI solution for brand sentiment tracking and establishes a strong benchmark for multilingual NLP in low-resource, code-mixed environments.
Similar Papers
HiFACTMix: A Code-Mixed Benchmark and Graph-Aware Model for EvidenceBased Political Claim Verification in Hinglish
Computation and Language
Checks if political claims in mixed languages are true.
Comparative Study of Pre-Trained BERT and Large Language Models for Code-Mixed Named Entity Recognition
Computation and Language
Helps computers understand mixed Hindi-English text.
Sample-Efficient Language Model for Hinglish Conversational AI
Computation and Language
Teaches computers to chat in Hindi and English.