CodeWiki: Automated Repository-Level Documentation at Scale
By: Nguyen Hoang Anh , Minh Le-Anh , Bach Le and more
Potential Business Impact:
Helps programmers understand big code projects easily.
Developers spend nearly 58% of their time understanding codebases, yet maintaining comprehensive documentation remains challenging due to complexity and manual effort. While recent Large Language Models (LLMs) show promise for function-level documentation, they fail at the repository level, where capturing architectural patterns and cross-module interactions is essential. We introduce CodeWiki, the first open-source framework for holistic repository-level documentation across seven programming languages. CodeWiki employs three innovations: (i) hierarchical decomposition that preserves architectural context, (ii) recursive agentic processing with dynamic delegation, and (iii) synthesis of textual and visual artifacts including architecture diagrams and data flows. We also present CodeWikiBench, the first repository-level documentation benchmark with multi-level rubrics and agentic assessment. CodeWiki achieves 68.79% quality score with proprietary models and 64.80% with open-source alternatives, outperforming existing closed-source systems and demonstrating scalable, accurate documentation for real-world repositories.
Similar Papers
CodeWiki: Evaluating AI's Ability to Generate Holistic Documentation for Large-Scale Codebases
Software Engineering
Makes computer code easier to understand automatically.
SWE-QA: Can Language Models Answer Repository-level Code Questions?
Computation and Language
Helps computers understand large code projects.
DeepCode: Open Agentic Coding
Software Engineering
Turns research papers into working computer code.