The paper introduces LongDA, a novel benchmark designed to evaluate Large Language Model (LLM) agents in documentation-intensive analytical workflows. Unlike previous benchmarks that often assume clean, well-specified inputs, LongDA reflects real-world settings where the primary bottleneck is navigating long, heterogeneous documentation to understand complex data structures.
Key aspects of the research include:
- The Benchmark: LongDA contains 505 analytical queries extracted from expert-written publications across 17 U.S. national surveys. To solve these queries, agents must retrieve and integrate information from unstructured documentation—such as codebooks and methodological reports—that averages 263,000 tokens in length.
- The Framework: The authors developed LongTA, a lightweight, tool-augmented baseline framework. It employs a ReAct-style loop that allows agents to interleave document navigation (using specialized search and retrieval tools) with Python code execution for statistical computation.
- Experimental Results: Evaluating a range of proprietary and open-source models, including GPT-5 and DeepSeek-V3.2, revealed substantial performance gaps. Even the strongest model, GPT-5 (High), achieved only a 68.91% match rate, indicating significant room for improvement.
- Key Findings: The study identifies information retrieval and strategic tool use as the primary bottlenecks in these workflows, rather than pure logical reasoning. Performance was also negatively affected by longer contexts and more complex answer structures, such as lists versus single numerical values.
Ultimately, the authors position LongDA as a challenging testbed to drive the development of more reliable and autonomous data analysis agents for high-stakes, real-world settings.