Score: 0

A Datalake for Data-driven Social Science Research

Published: December 2, 2025 | arXiv ID: 2512.02463v1

By: Puneet Arya , Ojas Sahasrabudhe , Adwaiya Srivastav and more

Potential Business Impact:

Helps scientists easily study people's behavior with data.

Business Areas:
Big Data Data and Analytics

Social science research increasingly demands data-driven insights, yet researchers often face barriers such as lack of technical expertise, inconsistent data formats, and limited access to reliable datasets.Social science research increasingly demands data-driven insights, yet researchers often face barriers such as lack of technical expertise, inconsistent data formats, and limited access to reliable datasets. In this paper, we present a Datalake infrastructure tailored to the needs of interdisciplinary social science research. Our system supports ingestion and integration of diverse data types, automatic provenance and version tracking, role-based access control, and built-in tools for visualization and analysis. We demonstrate the utility of our Datalake using real-world use cases spanning governance, health, and education. A detailed walkthrough of one such use case -- analyzing the relationship between income, education, and infant mortality -- shows how our platform streamlines the research process while maintaining transparency and reproducibility. We argue that such infrastructure can democratize access to advanced data science practices, especially for NGOs, students, and grassroots organizations. The Datalake continues to evolve with plans to support ML pipelines, mobile access, and citizen data feedback mechanisms.

Country of Origin
🇮🇳 India

Page Count
8 pages

Category
Computer Science:
Databases