Dev.to Machine Learning2h ago|Research & Papers Products & Services

AI Research Monthly: Feb-Mar 2026 — 21 Findings With Hard Data

A comprehensive review of the latest AI research benchmarks and evaluations, revealing challenges with existing coding tests and the rapid progress of AI in math and reasoning tasks.

💡

Why it matters

These benchmark results are crucial for accurately evaluating the current state of AI and guiding future research and development priorities.

Key Points

1Audit found 59.4% of the hardest problems in the popular SWE-bench coding test had flawed test cases
2Top AI models scored only 0.9 points apart on SWE-bench, indicating benchmark contamination
3New SWE-bench Pro test shows real scores drop from 80% to 45-57%
4AI that scores 74% on bug fixes scores only 11% on end-to-end feature development tasks
5AI scores on 'Humanity's Last Exam' jumped from single digits to 37% in one year

Details

The article examines several recent AI research benchmarks that reveal significant issues with existing coding tests and rapid progress in math/reasoning tasks. An audit of the SWE-bench coding test found over 59% of the hardest problems had flawed test cases, allowing AI models to achieve inflated 80% scores by memorizing solutions. A new, harder 'SWE-bench Pro' test shows real scores drop to 45-57%. Another benchmark, FeatureBench, shows a 63-point gap between AI performance on bug fixes vs. end-to-end feature development. Meanwhile, AI scores on the extremely challenging 'Humanity's Last Exam' jumped from single digits to 37% in just one year, though still far behind human experts at 90%. These findings highlight the need for more rigorous, anti-contamination benchmark design as AI capabilities rapidly advance.

AI Research Monthly: Feb-Mar 2026 — 21 Findings With Hard Data

Why it matters

Key Points

Details

Dive deeper

Related Articles

The 9 ML Anomaly Detection Methods ThresholdIQ Uses — Expla…

Fast Guided Filter

CodeGen2: Lessons for Training LLMs on Programming and Natu…

Does Synthetic Data Generation of LLMs Help Clinical Text M…

Vision and Hardware Strategy Shaping the Future of AI

The Latest Frontier in Large Language Models: From Kimi K2.…

Unraveling the Layers of Today's AI Landscape

Building an Effective AI Training Data Pipeline

Lingvo: a Modular and Scalable Framework for Sequence-to-Se…

Benefits of the FTI Architecture - The Cleanest Way to Buil…

AI Curator

Ask me anything about AI

Related Articles

The 9 ML Anomaly Detection Methods ThresholdIQ Uses — Expla…

CodeGen2: Lessons for Training LLMs on Programming and Natu…

Does Synthetic Data Generation of LLMs Help Clinical Text M…

Vision and Hardware Strategy Shaping the Future of AI

The Latest Frontier in Large Language Models: From Kimi K2.…

Unraveling the Layers of Today's AI Landscape

Building an Effective AI Training Data Pipeline

Lingvo: a Modular and Scalable Framework for Sequence-to-Se…

Benefits of the FTI Architecture - The Cleanest Way to Buil…