OpenAI Reddit3d ago|研究・論文プロダクト・サービス

OpenAI Models Benchmarked on Fresh SWE GitHub PR Tasks

Researchers benchmarked 34 AI models, including OpenAI's GPT-5.2, on 47 real-world GitHub PR tasks from November 2025. GPT-5.2 matched the performance of Claude Sonnet 4.5 while being 2.7x cheaper per problem.

💡

Why it matters

This benchmark provides valuable insights into the performance and cost-effectiveness of various AI models for real-world software engineering tasks, which is crucial for industry adoption and development.

Key Points

1GPT-5.2 (medium) performed on par with Claude Code, with 61.3% resolved rate and 74.5% pass@5, while being 2.7x cheaper per problem ($0.47 vs $1.29)
2GPT-5 (2025-08-07 medium) and GPT-5.1 Codex Max formed a strong mid-tier, with ~58% resolved rate and ~72% pass@5, and ~$0.5 cost per problem
3The benchmark included models like Claude Code, Claude 4.5 Sonnet and Opus, Gemini 3 Pro, DeepSeek v3.2, and Devstral 2

Details

Researchers from Nebius benchmarked 34 AI models on 47 real-world GitHub PR tasks from November 2025, using the SWE-rebench leaderboard dataset. This dataset contains fresh tasks, avoiding training-set contamination. The OpenAI models evaluated include GPT-5.2 (medium), GPT-5 (2025-08-07 medium), and GPT-5.1 Codex Max. GPT-5.2 (medium) performed on par with Claude Code, with a 61.3% resolved rate and 74.5% pass@5, while being about 2.7x cheaper per problem ($0.47 vs $1.29) and using roughly 2.2x fewer tokens. The GPT-5 and GPT-5.1 Codex Max models formed a strong mid-tier, with around 58% resolved rate and 72% pass@5, and a cost per problem of around $0.5. The benchmark also included other models like Claude Sonnet and Opus, Gemini 3 Pro, DeepSeek v3.2, and Devstral 2.

OpenAI Models Benchmarked on Fresh SWE GitHub PR Tasks

Why it matters

Key Points

Details

Dive deeper

Related Articles

AIで動画を作るためのプロンプト

[Project] I built a fully offline AI Image Upscaler (up to …

Let's get the API party started! Things have changed.

GPT 5.2 only gives long reply to every question?

Balancing Creativity and Accuracy in AI Outputs

Plus vs Pro? Comparing OpenAI's AI Model Tiers

Understanding the Codex weekly reset

GPT‑5.2‑High sitting at #15 on LMArena… is the hype already…

アカウントが勝手に有料プランにアップグレード

OpenAI Interview for Research Engineer Roles

AI Curator

Ask me anything about AI

Related Articles

[Project] I built a fully offline AI Image Upscaler (up to …

Let's get the API party started! Things have changed.

GPT 5.2 only gives long reply to every question?

Balancing Creativity and Accuracy in AI Outputs

Plus vs Pro? Comparing OpenAI's AI Model Tiers

Understanding the Codex weekly reset

GPT‑5.2‑High sitting at #15 on LMArena… is the hype already…

OpenAI Interview for Research Engineer Roles