Shaping Explanations: Semantic Reward Modeling with Encoder-Only Transformers for GRPO
By: Francesco Pappone , Ruggero Marino Lazzaroni , Federico Califano and more
Potential Business Impact:
Teaches AI to explain things clearly and correctly.
While Large Language Models (LLMs) excel at generating human-like text, aligning their outputs with complex, qualitative goals like pedagogical soundness remains a significant challenge. Standard reinforcement learning techniques often rely on slow and expensive LLM-as-a-judge evaluations or on brittle, keyword-based metrics like ROUGE, which fail to capture the semantic essence of a high-quality explanation. In this work, we introduce a novel approach to reward shaping within the Group Relative Policy Optimisation (GRPO) framework. Our central contribution is the use of a small, efficient encoder-only transformer as a semantic reward model. This model provides a dense, semantically rich reward signal based on the cosine similarity between a generated explanation and a ground-truth reference, guiding the policy towards explanations that are not just factually correct but also structurally and conceptually aligned with expert reasoning. We apply this method to the task of training a model for the Italian medical-school entrance examinations, following standard domain-adaptive continued pre-training (CPT) and supervised fine-tuning (SFT). Our results demonstrate that GRPO with our proposed semantic reward significantly improves explanation faithfulness and clarity over a strong SFT baseline, showcasing the power of using lightweight encoder models for nuanced reward shaping in complex generation tasks
Similar Papers
Parent-Guided Semantic Reward Model (PGSRM): Embedding-Based Reward Functions for Reinforcement Learning of Transformer Language Models
Machine Learning (CS)
Teaches computers to write better using a smart trick.
Multi-Reward GRPO for Stable and Prosodic Single-Codebook TTS LLMs at Scale
Sound
Makes computer voices sound more natural and human.
Training-Free Group Relative Policy Optimization
Computation and Language
Teaches computers to solve new problems better.