Reward Model Routing in Alignment
By: Xinle Wu, Yao Lu
Potential Business Impact:
Helps AI learn better by using many "teachers."
Reinforcement learning from human or AI feedback (RLHF / RLAIF) has become the standard paradigm for aligning large language models (LLMs). However, most pipelines rely on a single reward model (RM), limiting alignment quality and risking overfitting. Recent work explores RM routing--dynamically selecting an RM from a candidate pool to exploit complementary strengths while maintaining $O(1)$ RM calls--but existing methods suffer from cold-start and insufficient exploration. We propose BayesianRouter, a hybrid routing framework that combines offline RM strengths learning with online Bayesian selection. In the offline stage, a multi-task router is trained on preference data to estimate per-RM reliability. In the online stage, a Bayesian Thompson sampling router performs per-query RM selection, initializing RM-specific weight vectors with offline embeddings as Gaussian priors and adaptively updating their posteriors with online rewards to adapt to the evolving policy distribution. Extensive experiments on instruction-following (AlpacaEval-2, Arena-Hard, MT-Bench) and reasoning (GSM8K, MMLU) benchmarks show that BayesianRouter consistently outperforms individual RMs, RM ensembling, and existing routing methods.
Similar Papers
Ask a Strong LLM Judge when Your Reward Model is Uncertain
Machine Learning (CS)
Lets AI learn better by using smart guessing.
HierRouter: Coordinated Routing of Specialized Large Language Models via Reinforcement Learning
Computation and Language
Makes smart computer programs run faster and cheaper.
Router-R1: Teaching LLMs Multi-Round Routing and Aggregation via Reinforcement Learning
Computation and Language
Lets AI pick the best AI for each question.