Listen

Description

This paper introduces Self-Distillation Policy Optimization (SDPO), a novel reinforcement learning framework designed to improve how large language models learn from complex environments. While traditional methods often rely on simple scalar rewards that create information bottlenecks, SDPO utilizes rich textual feedback, such as runtime errors or descriptive evaluations, to provide denser learning signals. By treating the current model as a self-teacher that re-evaluates its own attempts in light of this feedback, the algorithm distills corrected predictions back into the policy without needing external human or AI mentors. Research shows that this approach significantly enhances sample efficiency and reasoning accuracy across tasks like scientific problem-solving and competitive programming. Furthermore, SDPO qualitatively produces concise reasoning and avoids the repetitive verbosity common in other reinforcement learning techniques. At test-time, the method also accelerates the discovery of solutions for exceptionally difficult problems by iteratively refining the model’s internal logic.