Score: 2

FActBench: A Benchmark for Fine-grained Automatic Evaluation of LLM-Generated Text in the Medical Domain

Published: September 2, 2025 | arXiv ID: 2509.02198v1

By: Anum Afzal, Juraj Vladika, Florian Matthes

Potential Business Impact:

Checks if AI gives correct medical advice.

Business Areas:
Natural Language Processing Artificial Intelligence, Data and Analytics, Software

Large Language Models tend to struggle when dealing with specialized domains. While all aspects of evaluation hold importance, factuality is the most critical one. Similarly, reliable fact-checking tools and data sources are essential for hallucination mitigation. We address these issues by providing a comprehensive Fact-checking Benchmark FActBench covering four generation tasks and six state-of-the-art Large Language Models (LLMs) for the Medical domain. We use two state-of-the-art Fact-checking techniques: Chain-of-Thought (CoT) Prompting and Natural Language Inference (NLI). Our experiments show that the fact-checking scores acquired through the Unanimous Voting of both techniques correlate best with Domain Expert Evaluation.

Country of Origin
🇩🇪 Germany

Repos / Data Links

Page Count
12 pages

Category
Computer Science:
Computation and Language