This September 2025 paper introduces Mini-o3, a Vision-Language Model (VLM) designed to overcome the limitations of existing VLMs in handling complex visual search tasks that require multi-turn reasoning and trial-and-error exploration. The researchers developed a three-component training recipe, including the creation of the Visual Probe Dataset with challenging, high-resolution images, a pipeline for synthesizing diverse multi-turn trajectories for supervised finetuning, and an over-turn masking technique in reinforcement learning. This masking prevents penalization of long, incomplete reasoning paths, encouraging deeper exploration without increasing training time. Mini-o3 demonstrates state-of-the-art performance on various visual search benchmarks, showcasing its enhanced ability for complex, adaptive visual understanding through iterative observation, thought, and action.
Source:
https://arxiv.org/pdf/2509.07969