Score: 0

Silence the Judge: Reinforcement Learning with Self-Verifier via Latent Geometric Clustering

Published: January 13, 2026 | arXiv ID: 2601.08427v1

By: Nonghai Zhang , Weitao Ma , Zhanyu Ma and more

Group Relative Policy Optimization (GRPO) significantly enhances the reasoning performance of Large Language Models (LLMs). However, this success heavily relies on expensive external verifiers or human rules. Such dependency not only leads to significant computational costs and training latency, but also yields sparse rewards that hinder optimization efficiency. To address these challenges, we propose Latent-GRPO, a framework that derives intrinsic rewards directly from latent space geometry. Crucially, our empirical analysis reveals a compelling geometric property: terminal token representations of correct reasoning trajectories form dense clusters with high intra-class similarity, whereas incorrect trajectories remain scattered as outliers. In light of this discovery, we introduce the Iterative Robust Centroid Estimation (IRCE) algorithm, which generates dense, continuous rewards by mitigating magnitude fluctuations via spherical projection and estimating a robust ``truth centroid'' through iterative aggregation. Experimental results on multiple datasets show that our method maintains model performance while achieving a training speedup of over 2x compared to baselines. Furthermore, extensive results demonstrate strong generalization ability and robustness. The code will be released soon.

GeometryZero: Improving Geometry Solving for LLM with Group Contrastive Policy Optimization

Computation and Language

Teaches small computers to solve geometry problems.

8 Jun 2025 0

90%

ICPO: Intrinsic Confidence-Driven Group Relative Preference Optimization for Efficient Reinforcement Learning

Artificial Intelligence

Makes AI better at thinking and solving problems.

26 Nov 2025 1

90%

ICPO: Intrinsic Confidence-Driven Group Relative Preference Optimization for Efficient Reinforcement Learning

Artificial Intelligence

Makes AI think better and avoid mistakes.

26 Nov 2025 1

View PDF Login to Bookmark

Silence the Judge: Reinforcement Learning with Self-Verifier via Latent Geometric Clustering

Technical Abstract

GeometryZero: Improving Geometry Solving for LLM with Group Contrastive Policy Optimization

ICPO: Intrinsic Confidence-Driven Group Relative Preference Optimization for Efficient Reinforcement Learning

ICPO: Intrinsic Confidence-Driven Group Relative Preference Optimization for Efficient Reinforcement Learning