Score: 0

RLHF Fine-Tuning of LLMs for Alignment with Implicit User Feedback in Conversational Recommenders

Published: August 7, 2025 | arXiv ID: 2508.05289v1

By: Zhongheng Yang , Aijia Sun , Yushang Zhao and more

Potential Business Impact:

Makes online suggestions better by learning from your actions.

Conversational recommender systems (CRS) based on Large Language Models (LLMs) need to constantly be aligned to the user preferences to provide satisfying and context-relevant item recommendations. The traditional supervised fine-tuning cannot capture the implicit feedback signal, e.g., dwell time, sentiment polarity, or engagement patterns. In this paper, we share a fine-tuning solution using human feedback reinforcement learning (RLHF) to maximize implied user feedback (IUF) in a multi-turn recommendation context. We specify a reward model $R_{\phi}$ learnt on weakly-labelled engagement information and maximize user-centric utility by optimizing the foundational LLM M_{\theta} through a proximal policy optimization (PPO) approach. The architecture models conversational state transitions $s_t \to a_t \to s_{t +1}$, where the action $a_t$ is associated with LLM-generated item suggestions only on condition of conversation history in the past. The evaluation across synthetic and real-world datasets (e.g.REDIAL, OpenDialKG) demonstrates that our RLHF-fine-tuned models can perform better in terms of top-$k$ recommendation accuracy, coherence, and user satisfaction compared to (arrow-zero-cmwrquca-teja-falset ensuite 2Round group-deca States penalty give up This paper shows that implicit signal alignment can be efficient in achieving scalable and user-adaptive design of CRS.

Robust Reinforcement Learning from Human Feedback for Large Language Models Fine-Tuning

Machine Learning (Stat)

Makes AI understand what people want better.

3 Apr 2025 1

90%

RLHF: A comprehensive Survey for Cultural, Multimodal and Low Latency Alignment Methods

Machine Learning (CS)

Makes AI understand and act fairly.

6 Nov 2025 0

90%

Contextual Online Uncertainty-Aware Preference Learning for Human Feedback

Machine Learning (Stat)

Teaches AI to learn what people like.

27 Apr 2025 1

View PDF Login to Bookmark

Country of Origin

🇺🇸 United States

Page Count

5 pages

RLHF Fine-Tuning of LLMs for Alignment with Implicit User Feedback in Conversational Recommenders

Makes online suggestions better by learning from your actions.

Technical Abstract

Robust Reinforcement Learning from Human Feedback for Large Language Models Fine-Tuning

RLHF: A comprehensive Survey for Cultural, Multimodal and Low Latency Alignment Methods

Contextual Online Uncertainty-Aware Preference Learning for Human Feedback