https://arxiv.org/html/2504.20879v1The Leaderboard Illusion
Measuring progress is crucial for any scientific field, and in Artificial Intelligence, benchmarks play a vital role. Recently, Chatbot Arena has become the most influential leaderboard for ranking the capabilities of large language models (LLMs), significantly impacting media perception, industry trends, and academic research. However, a new research paper titled "The Leaderboard Illusion" reveals systematic issues within Chatbot Arena that may distort the rankings and present a misleading picture of AI progress.
The paper highlights several critical findings based on an analysis of millions of Chatbot Arena battles, numerous models, and provider practices:
Undisclosed Private Testing & Selective Reporting: A core issue is an unstated policy allowing a select few providers (often large, proprietary ones) to privately test multiple versions of their models on the Arena. They can then choose to publicly release only the best-performing variant and retract or hide lower scores. The paper notes an extreme case where Meta tested 27 private variants leading up to the Llama-4 release. This "handpicking" of scores violates fair comparison principles and artificially inflates the rankings of these preferred providers.
Significant Data Access Disparities: Proprietary, closed models receive a disproportionately large share of user interaction data (prompts and battle outcomes) compared to open-weight and open-source models. For instance, Google and OpenAI models received an estimated 19.2% and 20.4% of all Arena data respectively, while a combined 83 open-weight models received only about 29.7%. This asymmetry gives proprietary developers significantly more valuable, free community feedback data.
Overfitting Risk Fueled by Data Access: Access to this Arena data provides substantial performance benefits specifically on the Arena benchmark. The researchers estimate that even limited Arena data can lead to relative performance gains of up to 112% on Arena-related tasks. This creates a high risk of models overfitting to the specific dynamics and user patterns of Chatbot Arena, rather than improving general capabilities applicable to real-world scenarios. Gains on the leaderboard might not translate to broader progress.
Unreliable Rankings Due to Model Deprecation: Many models, particularly open-weight and open-source ones (around 88-89% of them), are "silently deprecated" – their sampling rate is reduced to near zero without official notification. This practice, combined with the changing nature of user prompts over time (distribution shifts), means that older model scores become outdated. It violates key assumptions of the Bradley-Terry ranking model used by the Arena, reducing the reliability and stability of the leaderboard rankings over time.
Why This Matters
These findings suggest that the current state of Chatbot Arena, while built on valuable community effort, systematically favors a small group of large providers. This distorts our perception of true AI progress, creates an uneven playing field that disadvantages open research, and potentially encourages "teaching to the test" rather than genuine innovation.
Recommendations for Reform
The paper concludes by offering actionable recommendations to restore fairness, transparency, and trust in Chatbot Arena. These include prohibiting score retraction after submission, establishing transparent limits on private testing, ensuring model removals are applied equally across license types, implementing fairer sampling strategies, and providing full public transparency into all tested models, deprecations, and sampling rates.
In conclusion, "The Leaderboard Illusion" serves as a critical analysis of a highly influential AI benchmark. It calls for the AI community and benchmark organizers to address these systemic issues to ensure that leaderboards accurately reflect scientific progress and foster a truly fair and innovative ecosystem