OpenAI Models Benchmarked on Fresh SWE GitHub PR Tasks
Researchers benchmarked 34 AI models, including OpenAI's GPT-5.2, on 47 real-world GitHub PR tasks from November 2025. GPT-5.2 matched the performance of Claude Sonnet 4.5 while being 2.7x cheaper per problem.
Why it matters
This benchmark provides valuable insights into the performance and cost-effectiveness of various AI models for real-world software engineering tasks, which is crucial for industry adoption and development.
Key Points
- 1GPT-5.2 (medium) performed on par with Claude Code, with 61.3% resolved rate and 74.5% pass@5, while being 2.7x cheaper per problem ($0.47 vs $1.29)
- 2GPT-5 (2025-08-07 medium) and GPT-5.1 Codex Max formed a strong mid-tier, with ~58% resolved rate and ~72% pass@5, and ~$0.5 cost per problem
- 3The benchmark included models like Claude Code, Claude 4.5 Sonnet and Opus, Gemini 3 Pro, DeepSeek v3.2, and Devstral 2
Details
Researchers from Nebius benchmarked 34 AI models on 47 real-world GitHub PR tasks from November 2025, using the SWE-rebench leaderboard dataset. This dataset contains fresh tasks, avoiding training-set contamination. The OpenAI models evaluated include GPT-5.2 (medium), GPT-5 (2025-08-07 medium), and GPT-5.1 Codex Max. GPT-5.2 (medium) performed on par with Claude Code, with a 61.3% resolved rate and 74.5% pass@5, while being about 2.7x cheaper per problem ($0.47 vs $1.29) and using roughly 2.2x fewer tokens. The GPT-5 and GPT-5.1 Codex Max models formed a strong mid-tier, with around 58% resolved rate and 72% pass@5, and a cost per problem of around $0.5. The benchmark also included other models like Claude Sonnet and Opus, Gemini 3 Pro, DeepSeek v3.2, and Devstral 2.
No comments yet
Be the first to comment