LongBench Pro: A More Realistic and Comprehensive Bilingual Long-Context Evaluation Benchmark
By: Ziyang Chen , Xing Wu , Junlong Jia and more
Potential Business Impact:
Tests AI's memory for long stories.
The rapid expansion of context length in large language models (LLMs) has outpaced existing evaluation benchmarks. Current long-context benchmarks often trade off scalability and realism: synthetic tasks underrepresent real-world complexity, while fully manual annotation is costly to scale to extreme lengths and diverse scenarios. We present LongBench Pro, a more realistic and comprehensive bilingual benchmark of 1,500 naturally occurring long-context samples in English and Chinese spanning 11 primary tasks and 25 secondary tasks, with input lengths from 8k to 256k tokens. LongBench Pro supports fine-grained analysis with task-specific metrics and a multi-dimensional taxonomy of context requirement (full vs. partial dependency), length (six levels), and difficulty (four levels calibrated by model performance). To balance quality with scalability, we propose a Human-Model Collaborative Construction pipeline: frontier LLMs draft challenging questions and reference answers, along with design rationales and solution processes, to reduce the cost of expert verification. Experts then rigorously validate correctness and refine problematic cases. Evaluating 46 widely used long-context LLMs on LongBench Pro yields three findings: (1) long-context optimization contributes more to long-context comprehension than parameter scaling; (2) effective context length is typically shorter than the claimed context length, with pronounced cross-lingual misalignment; and (3) the "thinking" paradigm helps primarily models trained with native reasoning, while mixed-thinking designs offer a promising Pareto trade-off. In summary, LongBench Pro provides a robust testbed for advancing long-context understanding.
Similar Papers
LoCoBench: A Benchmark for Long-Context Large Language Models in Complex Software Engineering
Software Engineering
Tests if AI can understand huge computer programs.
MMLongBench: Benchmarking Long-Context Vision-Language Models Effectively and Thoroughly
CV and Pattern Recognition
Tests computers that understand many pictures and words.
100-LongBench: Are de facto Long-Context Benchmarks Literally Evaluating Long-Context Ability?
Computation and Language
Tests AI's memory for long stories.