Listen

Description

The September 2025 academic paper introduces **LLM-Interleaved (LLM-I)**, a novel, flexible framework for interleaved image-text generation that reframes the task as a **tool-use problem** to overcome the "one-tool" limitation of unified models. Authored by researchers from Zhejiang University and **ByteDance, BandAI**, the system uses a central Large Language Model (LLM) or Multimodal LLM (MLLM) agent to orchestrate a diverse toolkit of specialized visual tools, including online image search, diffusion generation, code execution, and image editing. The agent is trained using a **Reinforcement Learning (RL) framework** featuring a hybrid reward system that combines rule-based logic with LLM and MLLM evaluators. The research demonstrates that LLM-I achieves **state-of-the-art performance** across four benchmarks by moving from an "omniscient solver" to a "proficient tool-user" paradigm, allowing for factually grounded and programmatically precise visual outputs.

Source:

https://arxiv.org/pdf/2509.13642