Score: 0

IndicRAGSuite: Large-Scale Datasets and a Benchmark for Indian Language RAG Systems

Published: June 2, 2025 | arXiv ID: 2506.01615v2

By: Pasunuti Prasanjith , Prathmesh B More , Anoop Kunchukuttan and more

Potential Business Impact:

Helps computers understand Indian languages better.

Business Areas:

Text Analytics Data and Analytics, Software

Retrieval-Augmented Generation (RAG) systems enable language models to access relevant information and generate accurate, well-grounded, and contextually informed responses. However, for Indian languages, the development of high-quality RAG systems is hindered by the lack of two critical resources: (1) evaluation benchmarks for retrieval and generation tasks, and (2) large-scale training datasets for multilingual retrieval. Most existing benchmarks and datasets are centered around English or high-resource languages, making it difficult to extend RAG capabilities to the diverse linguistic landscape of India. To address the lack of evaluation benchmarks, we create IndicMSMarco, a multilingual benchmark for evaluating retrieval quality and response generation in 13 Indian languages, created via manual translation of 1000 diverse queries from MS MARCO-dev set. To address the need for training data, we build a large-scale dataset of (question, answer, relevant passage) tuples derived from the Wikipedias of 19 Indian languages using state-of-the-art LLMs. Additionally, we include translated versions of the original MS MARCO dataset to further enrich the training data and ensure alignment with real-world information-seeking tasks. Resources are available here: https://huggingface.co/collections/ai4bharat/indicragsuite-683e7273cb2337208c8c0fcb

M4-RAG: A Massive-Scale Multilingual Multi-Cultural Multimodal RAG

Computation and Language

Helps computers answer questions about pictures in many languages.

5 Dec 2025 3

89%

MEMERAG: A Multilingual End-to-End Meta-Evaluation Benchmark for Retrieval Augmented Generation

Computation and Language

Helps AI understand questions in many languages.

24 Feb 2025 3

88%

DeepRAG: Building a Custom Hindi Embedding Model for Retrieval Augmented Generation from Scratch

Computation and Language

Helps computers understand Hindi text better.

11 Mar 2025 0

View PDF Login to Bookmark

Page Count

10 pages

IndicRAGSuite: Large-Scale Datasets and a Benchmark for Indian Language RAG Systems

Helps computers understand Indian languages better.

Technical Abstract

M4-RAG: A Massive-Scale Multilingual Multi-Cultural Multimodal RAG

MEMERAG: A Multilingual End-to-End Meta-Evaluation Benchmark for Retrieval Augmented Generation

DeepRAG: Building a Custom Hindi Embedding Model for Retrieval Augmented Generation from Scratch