Score: 1

DLER: Doing Length pEnalty Right - Incentivizing More Intelligence per Token via Reinforcement Learning

Published: October 16, 2025 | arXiv ID: 2510.15110v1

By: Shih-Yang Liu , Xin Dong , Ximing Lu and more

Potential Business Impact:

Makes AI answer questions shorter and smarter.

Business Areas:

Natural Language Processing Artificial Intelligence, Data and Analytics, Software

Reasoning language models such as OpenAI-o1, DeepSeek-R1, and Qwen achieve strong performance via extended chains of thought but often generate unnecessarily long outputs. Maximizing intelligence per token--accuracy relative to response length--remains an open problem. We revisit reinforcement learning (RL) with the simplest length penalty--truncation--and show that accuracy degradation arises not from the lack of sophisticated penalties but from inadequate RL optimization. We identify three key challenges: (i) large bias in advantage estimation, (ii) entropy collapse, and (iii) sparse reward signal. We address them with Doing Length pEnalty Right (DLER), a training recipe combining batch-wise reward normalization, higher clipping, dynamic sampling, and a simple truncation length penalty. DLER achieves state-of-the-art accuracy--efficiency trade-offs, cutting output length by over 70 percent while surpassing all previous baseline accuracy. It also improves test-time scaling: compared to DeepSeek-R1-7B, DLER-7B generates multiple concise responses in parallel with 28 percent higher accuracy and lower latency. We further introduce Difficulty-Aware DLER, which adaptively tightens truncation on easier questions for additional efficiency gains. We also propose an update-selective merging method that preserves baseline accuracy while retaining the concise reasoning ability of the DLER model, which is useful for scenarios where RL training data is scarce.

Thinking Fast and Right: Balancing Accuracy and Reasoning Length with Adaptive Rewards

Computation and Language

Makes AI think faster without making mistakes.

23 May 2025 1

89%

Just Enough Thinking: Efficient Reasoning with Adaptive Length Penalties Reinforcement Learning

Artificial Intelligence

Saves computer power by skipping easy problems.

5 Jun 2025 0

89%

Rethinking Reasoning Quality in Large Language Models through Enhanced Chain-of-Thought via RL

Artificial Intelligence

Teaches computers to think step-by-step better.

7 Sep 2025 1

View PDF Login to Bookmark

Page Count

22 pages

DLER: Doing Length pEnalty Right - Incentivizing More Intelligence per Token via Reinforcement Learning

Makes AI answer questions shorter and smarter.

Technical Abstract

Thinking Fast and Right: Balancing Accuracy and Reasoning Length with Adaptive Rewards

Just Enough Thinking: Efficient Reasoning with Adaptive Length Penalties Reinforcement Learning

Rethinking Reasoning Quality in Large Language Models through Enhanced Chain-of-Thought via RL