This August 2025 paper introduces rStar2-Agent, a 14B math reasoning model developed by Microsoft Research that achieves state-of-the-art performance comparable to much larger models by employing agentic reinforcement learning. The model is trained to "think smarter" through three key innovations: an efficient RL infrastructure that manages high-throughput code execution, a novel GRPO-RoC algorithm for effective reasoning in a noisy code environment by filtering high-quality trajectories, and an efficient training recipe that minimizes computational cost. Demonstrating superior accuracy on challenging math benchmarks like AIME24/25, rStar2-Agent-14B also exhibits strong generalization to other domains like scientific reasoning and tool-use tasks, all while producing shorter, more concise responses than its counterparts. The paper further explores how high-entropy tokens in the model's reasoning traces indicate advanced cognitive behaviors such as self-reflection and adaptive exploration in response to tool feedback and errors.
Source:
https://arxiv.org/pdf/2508.20722