Code2Doc: A Quality-First Curated Dataset for Code Documentation
By: Recep Kaan Karaman, Meftun Akarsu
Potential Business Impact:
Makes computer code explanations more accurate.
The performance of automatic code documentation generation models depends critically on the quality of the training data used for supervision. However, most existing code documentation datasets are constructed through large scale scraping of public repositories with limited quality control. As a result, they often contain noisy documentation, extensive duplication, and increasing contamination from AI generated content. These issues weaken the supervision signal available to learning-based models and complicate evaluation. We introduce \textbf{Code2Doc}, a quality-first curated dataset for function-level code documentation generation. Code2Doc consists of 13,358 high-quality function-documentation pairs extracted from widely used open-source repositories spanning five programming languages: Python, Java, TypeScript, JavaScript, and C++. The dataset is constructed using a four-stage curation pipeline that enforces documentation completeness and clarity, filters functions based on structural and complexity criteria, removes exact and near-duplicate code, and identifies documentation likely to be AI generated. Starting from 52,069 extracted candidates, only 25.6 percent satisfy all quality constraints. We provide a detailed analysis of the resulting dataset, which achieves a mean documentation quality score of 6.93 out of 10. Overall, 86.9% of samples contain explicit type annotations, and only 2.9\% are flagged as potentially AI generated. Baseline experiments show that fine-tuning a large language model on Code2Doc yields relative improvements of 29.47% in BLEU and 24.04% in ROUGE-L over zero shot performance, despite the modest dataset size. We release both the dataset and the full curation pipeline to support reproducible research on automatic code documentation generation.
Similar Papers
CodeWiki: Evaluating AI's Ability to Generate Holistic Documentation for Large-Scale Codebases
Software Engineering
Makes computer code easier to understand automatically.
CodeWiki: Automated Repository-Level Documentation at Scale
Software Engineering
Helps programmers understand big code projects easily.
Automated and Context-Aware Code Documentation Leveraging Advanced LLMs
Software Engineering
Writes helpful notes for computer code automatically.