Score: 1

Can Language Models Critique Themselves? Investigating Self-Feedback for Retrieval Augmented Generation at BioASQ 2025

Published: August 7, 2025 | arXiv ID: 2508.05366v1

By: Samy Ateia, Udo Kruschwitz

Potential Business Impact:

Helps AI learn to find better answers.

Agentic Retrieval Augmented Generation (RAG) and 'deep research' systems aim to enable autonomous search processes where Large Language Models (LLMs) iteratively refine outputs. However, applying these systems to domain-specific professional search, such as biomedical research, presents challenges, as automated systems may reduce user involvement and misalign with expert information needs. Professional search tasks often demand high levels of user expertise and transparency. The BioASQ CLEF 2025 challenge, using expert-formulated questions, can serve as a platform to study these issues. We explored the performance of current reasoning and nonreasoning LLMs like Gemini-Flash 2.0, o3-mini, o4-mini and DeepSeek-R1. A key aspect of our methodology was a self-feedback mechanism where LLMs generated, evaluated, and then refined their outputs for query expansion and for multiple answer types (yes/no, factoid, list, ideal). We investigated whether this iterative self-correction improves performance and if reasoning models are more capable of generating useful feedback. Preliminary results indicate varied performance for the self-feedback strategy across models and tasks. This work offers insights into LLM self-correction and informs future work on comparing the effectiveness of LLM-generated feedback with direct human expert input in these search systems.

Optimizing Medical Question-Answering Systems: A Comparative Study of Fine-Tuned and Zero-Shot Large Language Models with RAG Framework

Computation and Language

Answers medical questions accurately using reliable sources.

5 Dec 2025 1

90%

Introspective Growth: Automatically Advancing LLM Expertise in Technology Judgment

Computation and Language

Helps computers understand complex ideas better.

18 May 2025 0

90%

Agentic large language models improve retrieval-based radiology question answering

Computation and Language

Boosts AI accuracy in radiology diagnoses

1 Aug 2025 0

View PDF Login to Bookmark

Country of Origin

🇩🇪 Germany

Repos / Data Links

github.com

Page Count

17 pages

Can Language Models Critique Themselves? Investigating Self-Feedback for Retrieval Augmented Generation at BioASQ 2025

Helps AI learn to find better answers.

Technical Abstract

Optimizing Medical Question-Answering Systems: A Comparative Study of Fine-Tuned and Zero-Shot Large Language Models with RAG Framework

Introspective Growth: Automatically Advancing LLM Expertise in Technology Judgment

Agentic large language models improve retrieval-based radiology question answering