Score: 1

The FACTS Leaderboard: A Comprehensive Benchmark for Large Language Model Factuality

Published: December 11, 2025 | arXiv ID: 2512.10791v1

By: Aileen Cheng , Alon Jacovi , Amir Globerson and more

BigTech Affiliations: Google

Potential Business Impact:

Tests if AI tells the truth in different ways.

Business Areas:

Text Analytics Data and Analytics, Software

We introduce The FACTS Leaderboard, an online leaderboard suite and associated set of benchmarks that comprehensively evaluates the ability of language models to generate factually accurate text across diverse scenarios. The suite provides a holistic measure of factuality by aggregating the performance of models on four distinct sub-leaderboards: (1) FACTS Multimodal, which measures the factuality of responses to image-based questions; (2) FACTS Parametric, which assesses models' world knowledge by answering closed-book factoid questions from internal parameters; (3) FACTS Search, which evaluates factuality in information-seeking scenarios, where the model must use a search API; and (4) FACTS Grounding (v2), which evaluates whether long-form responses are grounded in provided documents, featuring significantly improved judge models. Each sub-leaderboard employs automated judge models to score model responses, and the final suite score is an average of the four components, designed to provide a robust and balanced assessment of a model's overall factuality. The FACTS Leaderboard Suite will be actively maintained, containing both public and private splits to allow for external participation while guarding its integrity. It can be found at https://www.kaggle.com/benchmarks/google/facts .

FActBench: A Benchmark for Fine-grained Automatic Evaluation of LLM-Generated Text in the Medical Domain

Computation and Language

Checks if AI gives correct medical advice.

2 Sep 2025 2

90%

RealFactBench: A Benchmark for Evaluating Large Language Models in Real-World Fact-Checking

Computation and Language

Tests computers' ability to spot fake news.

14 Jun 2025 1

89%

FACTORY: A Challenging Human-Verified Prompt Set for Long-Form Factuality

Computation and Language

Tests AI facts with human-verified challenges

31 Jul 2025 4

View PDF Login to Bookmark

Country of Origin

🇺🇸 United States

Page Count

18 pages

The FACTS Leaderboard: A Comprehensive Benchmark for Large Language Model Factuality

Tests if AI tells the truth in different ways.

Technical Abstract

FActBench: A Benchmark for Fine-grained Automatic Evaluation of LLM-Generated Text in the Medical Domain

RealFactBench: A Benchmark for Evaluating Large Language Models in Real-World Fact-Checking

FACTORY: A Challenging Human-Verified Prompt Set for Long-Form Factuality