A Greek Government Decisions Dataset for Public-Sector Analysis and Insight
By: Giorgos Antoniou , Giorgos Filandrianos , Aggelos Vlachos and more
Potential Business Impact:
Lets computers answer questions about government rules.
We introduce an open, machine-readable corpus of Greek government decisions sourced from the national transparency platform Diavgeia. The resource comprises 1 million decisions, featuring and high-quality raw text extracted from PDFs. It is released with raw extracted text in Markdown format, alongside a fully reproducible extraction pipeline. Beyond the core dataset, we conduct qualitative analyses to explore boilerplate patterns and design a retrieval-augmented generation (RAG) task by formulating a set of representative questions, creating high-quality answers, and evaluating a baseline RAG system on its ability to retrieve and reason over public decisions. This evaluation demonstrates the potential of large-scale public-sector corpora to support advanced information access and transparency through structured retrieval and reasoning over governmental documents, and highlights how such a RAG pipeline could simulate a chat-based assistant capable of interactively answering questions about public decisions. Due to its scale, quality, and domain coverage, the corpus can also serve as high-value pre-training or fine-tuning material for new Language Models (LMs) and Large Language Models (LLMs) respectively, including specialized models for legal and governmental domains, and as a foundation for novel approaches in domain adaptation, knowledge-grounded generation, and explainable AI. Finally, we discuss limitations, outline future directions, and make both the data and the code accessible.
Similar Papers
On-Premise AI for the Newsroom: Evaluating Small Language Models for Investigative Document Search
Information Retrieval
Helps reporters find facts faster and safer.
PolicyBot - Reliable Question Answering over Policy Documents
Emerging Technologies
Answers questions about government rules easily.
Demo: Guide-RAG: Evidence-Driven Corpus Curation for Retrieval-Augmented Generation in Long COVID
Artificial Intelligence
Helps doctors answer tricky Long COVID questions.