LittiChoQA: Literary Texts in Indic Languages Chosen for Question Answering
By: Aarya Khandelwal, Ritwik Mishra, Rajiv Ratn Shah
Potential Business Impact:
Helps computers understand stories in Indian languages.
Long-context question answering (QA) over literary texts poses significant challenges for modern large language models, particularly in low-resource languages. We address the scarcity of long-context QA resources for Indic languages by introducing LittiChoQA, the largest literary QA dataset to date covering many languages spoken in the Gangetic plains of India. The dataset comprises over 270K automatically generated question-answer pairs with a balanced distribution of factoid and non-factoid questions, generated from naturally authored literary texts collected from the open web. We evaluate multiple multilingual LLMs on non-factoid, abstractive QA, under both full-context and context-shortened settings. Results demonstrate a clear trade-off between performance and efficiency: full-context fine-tuning yields the highest token-level and semantic-level scores, while context shortening substantially improves throughput. Among the evaluated models, Krutrim-2 achieves the strongest performance, obtaining a semantic score of 76.1 with full context. While, in shortened context settings it scores 74.9 with answer paragraph selection and 71.4 with vector-based retrieval. Qualitative evaluations further corroborate these findings.
Similar Papers
Long-context Non-factoid Question Answering in Indic Languages
Computation and Language
Helps computers answer questions from long texts.
LiteraryQA: Towards Effective Evaluation of Long-document Narrative QA
Computation and Language
Helps computers understand stories better.
PustakAI: Curriculum-Aligned and Interactive Textbooks Using Large Language Models
Computation and Language
Helps AI answer school questions accurately.