Score: 1

LoCoBench: A Benchmark for Long-Context Large Language Models in Complex Software Engineering

Published: September 11, 2025 | arXiv ID: 2509.09614v1

By: Jielin Qiu , Zuxin Liu , Zhiwei Liu and more

Potential Business Impact:

Tests if AI can understand huge computer programs.

Business Areas:

Natural Language Processing Artificial Intelligence, Data and Analytics, Software

The emergence of long-context language models with context windows extending to millions of tokens has created new opportunities for sophisticated code understanding and software development evaluation. We propose LoCoBench, a comprehensive benchmark specifically designed to evaluate long-context LLMs in realistic, complex software development scenarios. Unlike existing code evaluation benchmarks that focus on single-function completion or short-context tasks, LoCoBench addresses the critical evaluation gap for long-context capabilities that require understanding entire codebases, reasoning across multiple files, and maintaining architectural consistency across large-scale software systems. Our benchmark provides 8,000 evaluation scenarios systematically generated across 10 programming languages, with context lengths spanning 10K to 1M tokens, a 100x variation that enables precise assessment of long-context performance degradation in realistic software development settings. LoCoBench introduces 8 task categories that capture essential long-context capabilities: architectural understanding, cross-file refactoring, multi-session development, bug investigation, feature implementation, code comprehension, integration testing, and security analysis. Through a 5-phase pipeline, we create diverse, high-quality scenarios that challenge LLMs to reason about complex codebases at unprecedented scale. We introduce a comprehensive evaluation framework with 17 metrics across 4 dimensions, including 8 new evaluation metrics, combined in a LoCoBench Score (LCBS). Our evaluation of state-of-the-art long-context models reveals substantial performance gaps, demonstrating that long-context understanding in complex software development represents a significant unsolved challenge that demands more attention. LoCoBench is released at: https://github.com/SalesforceAIResearch/LoCoBench.

LoCoBench-Agent: An Interactive Benchmark for LLM Agents in Long-Context Software Engineering

Software Engineering

Tests AI's ability to write complex computer code.

17 Nov 2025 0

92%

LongBench Pro: A More Realistic and Comprehensive Bilingual Long-Context Evaluation Benchmark

Computation and Language

Tests AI's memory for long stories.

6 Jan 2026 0

92%

LongCodeBench: Evaluating Coding LLMs at 1M Context Windows

Computation and Language

Tests if computers can understand long computer code.

12 May 2025 3

View PDF Login to Bookmark

Repos / Data Links

github.com

Page Count

53 pages

LoCoBench: A Benchmark for Long-Context Large Language Models in Complex Software Engineering

Tests if AI can understand huge computer programs.

Technical Abstract

LoCoBench-Agent: An Interactive Benchmark for LLM Agents in Long-Context Software Engineering

LongBench Pro: A More Realistic and Comprehensive Bilingual Long-Context Evaluation Benchmark

LongCodeBench: Evaluating Coding LLMs at 1M Context Windows