FActBench: A Benchmark for Fine-grained Automatic Evaluation of LLM-Generated Text in the Medical Domain
By: Anum Afzal, Juraj Vladika, Florian Matthes
Potential Business Impact:
Checks if AI gives correct medical advice.
Large Language Models tend to struggle when dealing with specialized domains. While all aspects of evaluation hold importance, factuality is the most critical one. Similarly, reliable fact-checking tools and data sources are essential for hallucination mitigation. We address these issues by providing a comprehensive Fact-checking Benchmark FActBench covering four generation tasks and six state-of-the-art Large Language Models (LLMs) for the Medical domain. We use two state-of-the-art Fact-checking techniques: Chain-of-Thought (CoT) Prompting and Natural Language Inference (NLI). Our experiments show that the fact-checking scores acquired through the Unanimous Voting of both techniques correlate best with Domain Expert Evaluation.
Similar Papers
MedFact: Benchmarking the Fact-Checking Capabilities of Large Language Models on Chinese Medical Texts
Computation and Language
Helps doctors trust AI's medical advice.
RealFactBench: A Benchmark for Evaluating Large Language Models in Real-World Fact-Checking
Computation and Language
Tests computers' ability to spot fake news.
FaStfact: Faster, Stronger Long-Form Factuality Evaluations in LLMs
Computation and Language
Checks if AI stories are true, fast.