Alibaba's SWE-CI benchmark tested 18 AI models on 100 real codebases across 233 days of maintenance. Most agents accumulate technical debt and break previously working code. Only Claude Opus stays above 50% zero-regression.
Want to check another podcast?
Enter the RSS feed of a podcast, and see all of their public statistics.