Score: 0

OptPO: Optimal Rollout Allocation for Test-time Policy Optimization

Published: December 2, 2025 | arXiv ID: 2512.02882v1

By: Youkang Wang , Jian Wang , Rubing Chen and more

Potential Business Impact:

Makes AI smarter by learning from its own mistakes.

Business Areas:

A/B Testing Data and Analytics

Test-time policy optimization enables large language models (LLMs) to adapt to distribution shifts by leveraging feedback from self-generated rollouts. However, existing methods rely on fixed-budget majority voting to estimate rewards, incurring substantial computational redundancy. We propose Optimal Rollout Allocation for Test-time Policy Optimization (OptPO), a principled framework that adaptively allocates inference budgets. By formulating the voting process as a Bayesian sequential probability ratio test, OptPO dynamically halts sampling once the posterior confidence in a consensus answer exceeds a specified threshold. Crucially, it utilizes the retained rollouts for on-policy updates, seamlessly integrating with algorithms like PPO or GRPO without requiring ground-truth labels. Across diverse reasoning benchmarks, OptPO significantly reduces rollout overhead compared to fixed-sample baselines while preserving or improving accuracy. By unifying statistically optimal stopping with test-time learning, OptPO offers a computationally efficient paradigm for test-time adaptation. The source code will be open upon acceptance at https://open-upon-acceptance.

R$^2$PO: Decoupling Training Trajectories from Inference Responses for LLM Reasoning

Machine Learning (CS)

Makes AI smarter by training it better.

17 Jan 2026 0

89%

TreePO: Bridging the Gap of Policy Optimization and Efficacy and Inference Efficiency with Heuristic Tree-based Modeling

Machine Learning (CS)

Makes AI smarter and faster at thinking.

24 Aug 2025 0

89%

d-TreeRPO: Towards More Reliable Policy Optimization for Diffusion Language Models

Computation and Language

Helps AI solve math and logic puzzles better.

10 Dec 2025 0

View PDF Login to Bookmark

Country of Origin

🇭🇰 Hong Kong

Page Count

8 pages

OptPO: Optimal Rollout Allocation for Test-time Policy Optimization

Makes AI smarter by learning from its own mistakes.

Technical Abstract

R$^2$PO: Decoupling Training Trajectories from Inference Responses for LLM Reasoning

TreePO: Bridging the Gap of Policy Optimization and Efficacy and Inference Efficiency with Heuristic Tree-based Modeling

d-TreeRPO: Towards More Reliable Policy Optimization for Diffusion Language Models