Dev.to Machine Learning2h ago|Research & PapersProducts & Services

AI Research Monthly: Feb-Mar 2026 — 21 Findings With Hard Data

A comprehensive review of the latest AI research benchmarks and evaluations, revealing challenges with existing coding tests and the rapid progress of AI in math and reasoning tasks.

💡

Why it matters

These benchmark results are crucial for accurately evaluating the current state of AI and guiding future research and development priorities.

Key Points

  • 1Audit found 59.4% of the hardest problems in the popular SWE-bench coding test had flawed test cases
  • 2Top AI models scored only 0.9 points apart on SWE-bench, indicating benchmark contamination
  • 3New SWE-bench Pro test shows real scores drop from 80% to 45-57%
  • 4AI that scores 74% on bug fixes scores only 11% on end-to-end feature development tasks
  • 5AI scores on 'Humanity's Last Exam' jumped from single digits to 37% in one year

Details

The article examines several recent AI research benchmarks that reveal significant issues with existing coding tests and rapid progress in math/reasoning tasks. An audit of the SWE-bench coding test found over 59% of the hardest problems had flawed test cases, allowing AI models to achieve inflated 80% scores by memorizing solutions. A new, harder 'SWE-bench Pro' test shows real scores drop to 45-57%. Another benchmark, FeatureBench, shows a 63-point gap between AI performance on bug fixes vs. end-to-end feature development. Meanwhile, AI scores on the extremely challenging 'Humanity's Last Exam' jumped from single digits to 37% in just one year, though still far behind human experts at 90%. These findings highlight the need for more rigorous, anti-contamination benchmark design as AI capabilities rapidly advance.

Like
Save
Read original
Cached
Comments
?

No comments yet

Be the first to comment

AI Curator - Daily AI News Curation

AI Curator

Your AI news assistant

Ask me anything about AI

I can help you understand AI news, trends, and technologies