Listen

Description

A 30-task benchmark for evaluating long-horizon planning capabilities across 16 different AI models.