The paper addresses the limitation that Large Language Model (LLM) agents trained with standard reinforcement learning (RL) often struggle to actively explore their environments and adapt from trial-and-error experiences in multi-turn, long-horizon tasks.
To solve this, the authors introduce LAMER (LLM Agent with Meta-RL), a general Meta-RL framework designed to help agents actively explore and learn from environmental feedback at test time. LAMER achieves this balance between exploration and exploitation through two key components:
Extensive evaluations across complex environments—including Sokoban, MineSweeper, Webshop, and ALFWorld—demonstrate that LAMER significantly outperforms both prompting-based methods and standard RL baselines. By internalizing exploration strategies, LAMER produces more diverse trajectories, exhibits much stronger test-time scaling across multiple attempts, and generalizes significantly better to harder and out-of-distribution tasks.