OpenAI Models Benchmarked on Fresh SWE GitHub PR Tasks

Researchers benchmarked 34 AI models, including OpenAI's GPT-5.2, on 47 real-world GitHub PR tasks from November 2025. GPT-5.2 matched the performance of Claude Sonnet 4.5 while being 2.7x cheaper per problem.

💡

Why it matters

This benchmark provides valuable insights into the performance and cost-effectiveness of various AI models for real-world software engineering tasks, which is crucial for industry adoption and development.

Key Points

  • 1GPT-5.2 (medium) performed on par with Claude Code, with 61.3% resolved rate and 74.5% pass@5, while being 2.7x cheaper per problem ($0.47 vs $1.29)
  • 2GPT-5 (2025-08-07 medium) and GPT-5.1 Codex Max formed a strong mid-tier, with ~58% resolved rate and ~72% pass@5, and ~$0.5 cost per problem
  • 3The benchmark included models like Claude Code, Claude 4.5 Sonnet and Opus, Gemini 3 Pro, DeepSeek v3.2, and Devstral 2

Details

Researchers from Nebius benchmarked 34 AI models on 47 real-world GitHub PR tasks from November 2025, using the SWE-rebench leaderboard dataset. This dataset contains fresh tasks, avoiding training-set contamination. The OpenAI models evaluated include GPT-5.2 (medium), GPT-5 (2025-08-07 medium), and GPT-5.1 Codex Max. GPT-5.2 (medium) performed on par with Claude Code, with a 61.3% resolved rate and 74.5% pass@5, while being about 2.7x cheaper per problem ($0.47 vs $1.29) and using roughly 2.2x fewer tokens. The GPT-5 and GPT-5.1 Codex Max models formed a strong mid-tier, with around 58% resolved rate and 72% pass@5, and a cost per problem of around $0.5. The benchmark also included other models like Claude Sonnet and Opus, Gemini 3 Pro, DeepSeek v3.2, and Devstral 2.

Like
Save
Read original
Cached
Comments
?

No comments yet

Be the first to comment

AI Curator - Daily AI News Curation

AI Curator

Your AI news assistant

Ask me anything about AI

I can help you understand AI news, trends, and technologies