AI Research Monthly: Feb-Mar 2026 — 21 Findings With Hard Data
A comprehensive review of the latest AI research benchmarks and evaluations, revealing challenges with existing coding tests and the rapid progress of AI in math and reasoning tasks.
Why it matters
These benchmark results are crucial for accurately evaluating the current state of AI and guiding future research and development priorities.
Key Points
- 1Audit found 59.4% of the hardest problems in the popular SWE-bench coding test had flawed test cases
- 2Top AI models scored only 0.9 points apart on SWE-bench, indicating benchmark contamination
- 3New SWE-bench Pro test shows real scores drop from 80% to 45-57%
- 4AI that scores 74% on bug fixes scores only 11% on end-to-end feature development tasks
- 5AI scores on 'Humanity's Last Exam' jumped from single digits to 37% in one year
Details
The article examines several recent AI research benchmarks that reveal significant issues with existing coding tests and rapid progress in math/reasoning tasks. An audit of the SWE-bench coding test found over 59% of the hardest problems had flawed test cases, allowing AI models to achieve inflated 80% scores by memorizing solutions. A new, harder 'SWE-bench Pro' test shows real scores drop to 45-57%. Another benchmark, FeatureBench, shows a 63-point gap between AI performance on bug fixes vs. end-to-end feature development. Meanwhile, AI scores on the extremely challenging 'Humanity's Last Exam' jumped from single digits to 37% in just one year, though still far behind human experts at 90%. These findings highlight the need for more rigorous, anti-contamination benchmark design as AI capabilities rapidly advance.
No comments yet
Be the first to comment