This paper introduces MUSE (Memory-Utilizing and Self-Evolving), a novel AI agent framework designed to tackle complex, long-horizon productivity tasks. Existing Large Language Model (LLM) agents are typically "test-time static," meaning their capabilities are fixed after training, and they lack the ability to continuously learn from their successes or failures on the job. To solve this, MUSE provides an experience-driven, self-evolving system.
Here is a short summary of its key components and findings:
- The "Plan-Execute-Reflect-Memorize" Loop: MUSE operates through a continuous learning cycle. A Planning-Execution (PE) Agent breaks down tasks and attempts to execute them using a minimal set of basic tools. Afterward, a Reflect Agent autonomously evaluates the attempt, figures out what worked or failed, and translates the raw action sequences into structured, reusable knowledge.
- Hierarchical Memory Module: The framework centers around a memory system divided into three levels: Strategic Memory (high-level problem-solving guidance), Procedural Memory (standard operating procedures for tool sequences), and Tool Memory (instructions for individual tools). Because this memory is stored in natural language, the accumulated experience is LLM-agnostic and can be seamlessly transferred to different models.
- State-of-the-Art Results: The researchers evaluated MUSE on the TheAgentCompany (TAC) benchmark, which simulates a high-fidelity corporate environment. Using only a lightweight Gemini-2.5 Flash model, MUSE achieved a new State-of-the-Art (SOTA) score of 51.78%, outperforming the previous SOTA by nearly 20%.
- Generalization and Self-Evolution: The experiments demonstrated that as MUSE operates, it effectively accumulates experience to improve its task completion capabilities over time. Furthermore, this acquired knowledge showed strong zero-shot generalization, significantly boosting the agent's performance even when tackling entirely new, highly difficult tasks.