Listen

Description

The paper introduces LongDA, a novel benchmark designed to evaluate Large Language Model (LLM) agents in documentation-intensive analytical workflows. Unlike previous benchmarks that often assume clean, well-specified inputs, LongDA reflects real-world settings where the primary bottleneck is navigating long, heterogeneous documentation to understand complex data structures.

Key aspects of the research include:

Ultimately, the authors position LongDA as a challenging testbed to drive the development of more reliable and autonomous data analysis agents for high-stakes, real-world settings.