One-Step Flow Policy Mirror Descent
By: Tianyi Chen , Haitong Ma , Na Li and more
Potential Business Impact:
Makes robots learn and act much faster.
Diffusion policies have achieved great success in online reinforcement learning (RL) due to their strong expressive capacity. However, the inference of diffusion policy models relies on a slow iterative sampling process, which limits their responsiveness. To overcome this limitation, we propose Flow Policy Mirror Descent (FPMD), an online RL algorithm that enables 1-step sampling during policy inference. Our approach exploits a theoretical connection between the distribution variance and the discretization error of single-step sampling in straight interpolation flow matching models, and requires no extra distillation or consistency training. We present two algorithm variants based on flow policy and MeanFlow policy parametrizations, respectively. Extensive empirical evaluations on MuJoCo benchmarks demonstrate that our algorithms show strong performance comparable to diffusion policy baselines while requiring hundreds of times fewer function evaluations during inference.
Similar Papers
pi-Flow: Policy-Based Few-Step Generation via Imitation Distillation
Machine Learning (CS)
Makes AI create better pictures faster.
Flow Matching Policy Gradients
Machine Learning (CS)
Teaches robots to move better in tricky situations.
OMP: One-step Meanflow Policy with Directional Alignment
Robotics
Robots learn new tasks faster and more accurately.