A Chain of Thought reasoning academic brawl

Description

We review a late 2025 heavyweight academic brawl over the future of AI reasoning when folks use Reinforcement Learning with Verifiable Rewards (RLVR) for Chain of Thought (CoT). We have two papers on the table, and they are in complete, direct conflict.

In one corner, we have a study by Yue et al. from Tsinghua University, which dropped a bombshell claim: they argue that Reinforcement Learning with Verifiable Rewards (RLVR)—the technique behind major models like DeepSeek-R1—does not actually make models smarter. According to their research, RLVR acts more like a filter than a teacher; it improves the model's efficiency at finding correct answers it already knew how to find, but it **fails to expand the model's reasoning capabilities**. In fact, they claim that as training progresses, the model's reasoning boundary actually narrows.

But in the other corner, we have a rebuttal from Wen et al. at Microsoft Research, who came out swinging. They explicitly cite Yue’s paper, labeling the Tsinghua team’s hypothesis as "adventurous" and challenging their methodology. Wen et al. argue that Yue’s team missed the forest for the trees by relying on the wrong metric. They claim that because models can sometimes guess the right answer with the wrong math, the standard evaluation (Pass@K) is unreliable. By introducing a new metric that checks the steps of reasoning (CoT-Pass@K), Wen et al. insist that RLVR does fundamentally extend the boundary of intelligence and that the skeptics were looking at the data all wrong.

It is a classic scientific standoff: one side says the technology is an efficiency hack; the other says it's a fundamental leap forward.

Sources:

Paper 1

November 25, 2025

Yang Yue, Zhiqi Chen, Rui Lu, Andrew Zhao, Zhaokai Wang, Yang Yue, Shiji Song, Gao Huang

Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?

https://arxiv.org/pdf/2504.13837

Paper 2

June 2025

Xumeng Wen, Zihan Liu, Shun Zheng, Shengyu Ye, Zhirong Wu, Yang Wang, Zhijian Xu, Xiao Liang, Junjie Li, Ziming Miao, Jiang Bian, Mao Yang

REINFORCEMENT LEARNING WITH VERIFIABLE REWARDS IMPLICITLY INCENTIVIZES CORRECT REASONING IN BASE LLMS

https://arxiv.org/pdf/2506.14245

Listen

Description

Want to check another podcast?