KoSimpleQA: A Korean Factuality Benchmark with an Analysis of Reasoning LLMs
By: Donghyeon Ko , Yeguk Jin , Kyubyung Chae and more
Potential Business Impact:
Tests if AI knows Korean facts correctly.
We present $\textbf{Korean SimpleQA (KoSimpleQA)}$, a benchmark for evaluating factuality in large language models (LLMs) with a focus on Korean cultural knowledge. KoSimpleQA is designed to be challenging yet easy to grade, consisting of 1,000 short, fact-seeking questions with unambiguous answers. We conduct a comprehensive evaluation across a diverse set of open-source LLMs of varying sizes that support Korean, and find that even the strongest model generates correct answer only 33.7% of the time, underscoring the challenging nature of KoSimpleQA. Notably, performance rankings on KoSimpleQA differ substantially from those on the English SimpleQA, highlighting the unique value of our dataset. Furthermore, our analysis of reasoning LLMs shows that engaging reasoning capabilities in the factual QA task can both help models better elicit their latent knowledge and improve their ability to abstain when uncertain. KoSimpleQA can be found at https://anonymous.4open.science/r/KoSimpleQA-62EB.
Similar Papers
SimpleQA Verified: A Reliable Factuality Benchmark to Measure Parametric Knowledge
Computation and Language
Tests if AI tells the truth better.
Video SimpleQA: Towards Factuality Evaluation in Large Video Language Models
CV and Pattern Recognition
Tests if AI understands video facts correctly.
Ko-PIQA: A Korean Physical Commonsense Reasoning Dataset with Cultural Context
Computation and Language
Teaches computers Korean culture for better understanding.