Score: 0

Efficient Reinforcement Learning from Human Feedback via Bayesian Preference Inference

Published: November 6, 2025 | arXiv ID: 2511.04286v1

By: Matteo Cercola, Valeria Capretti, Simone Formentin

Potential Business Impact:

Teaches computers faster by asking them what they like.

Business Areas:

Personalization Commerce and Shopping

Learning from human preferences is a cornerstone of aligning machine learning models with subjective human judgments. Yet, collecting such preference data is often costly and time-consuming, motivating the need for more efficient learning paradigms. Two established approaches offer complementary advantages: RLHF scales effectively to high-dimensional tasks such as LLM fine-tuning, while PBO achieves greater sample efficiency through active querying. We propose a hybrid framework that unifies RLHF's scalability with PBO's query efficiency by integrating an acquisition-driven module into the RLHF pipeline, thereby enabling active and sample-efficient preference gathering. We validate the proposed approach on two representative domains: (i) high-dimensional preference optimization and (ii) LLM fine-tuning. Experimental results demonstrate consistent improvements in both sample efficiency and overall performance across these tasks.

Policy-labeled Preference Learning: Is Preference Enough for RLHF?

Machine Learning (CS)

Teaches computers to learn better from people.

6 May 2025 0

91%

Maximizing the efficiency of human feedback in AI alignment: a comparative analysis

Human-Computer Interaction

Teaches AI to learn faster from people's choices.

16 Nov 2025 2

91%

Maximizing the efficiency of human feedback in AI alignment: a comparative analysis

Human-Computer Interaction

Teaches computers to learn what people like faster.

16 Nov 2025 2

View PDF Login to Bookmark

Country of Origin

🇮🇹 Italy

Page Count

7 pages

Efficient Reinforcement Learning from Human Feedback via Bayesian Preference Inference

Teaches computers faster by asking them what they like.

Technical Abstract

Policy-labeled Preference Learning: Is Preference Enough for RLHF?

Maximizing the efficiency of human feedback in AI alignment: a comparative analysis

Maximizing the efficiency of human feedback in AI alignment: a comparative analysis