LC-Eval: A Bilingual Multi-Task Evaluation Benchmark for Long-Context Understanding
By: Sheikh Jubair , Arwa Omayrah , Amal Alshammari and more
Potential Business Impact:
Tests how well computers understand long stories.
Recent advancements in Large Language Models (LLMs) have demonstrated sophisticated capabilities, including the ability to process and comprehend extended contexts. These emergent capabilities necessitate rigorous evaluation methods to effectively assess their performance in long-context understanding. In this paper, we present \textbf{LC-Eval}, a bilingual, multi-task evaluation benchmark designed to evaluate long-context understanding in English and Arabic, targeting context lengths ranging from 4k to over 128k tokens. LC-Eval introduces four novel and challenging tasks: multi-document question answering, bilingual question answering, claim verification within a paragraph, and multiple-choice questions based on long contexts. These tasks are designed to assess LLMs' abilities in deep reasoning, document comprehension, information tracing, and bilingual information extraction and understanding. The benchmark includes datasets in both Arabic and English for each task, allowing for a comparative analysis of their performance across different text genres. Evaluations were conducted on both open-weight and closed LLMs, with results indicating that LC-Eval presents significant challenges. Even high-performing models, such as GPT-4o, struggled with certain tasks, highlighting the complexity and rigor of the benchmark.
Similar Papers
AcademicEval: Live Long-Context LLM Benchmark
Computation and Language
Tests if computers can understand long, complex writing.
Towards Reliable Evaluation of Large Language Models for Multilingual and Multimodal E-Commerce Applications
Artificial Intelligence
Tests online shopping AI on real customer questions.
MTQ-Eval: Multilingual Text Quality Evaluation for Language Models
Computation and Language
Helps computers judge good writing in many languages.