Devstral 2 vs Sonnet 4.5 (Claude Code) on SWE-bench
The author compared the performance of Devstral 2 (Mistral's Vibe) and Sonnet 4.5 (Claude Code) on the SWE-bench-verified-mini dataset, finding the results within statistical error.
Why it matters
This comparison of open-source and commercial AI models on a real-world benchmark provides insights into the current state of large language models and their performance on software engineering tasks.
Key Points
- 1Devstral 2 (Mistral's Vibe) matched Anthropic's best model (Claude Code) in the author's test
- 2Devstral 2 was faster than Claude Code, with a mean runtime of 296s vs 357s
- 3Both models showed high variance, with about 40% of test cases having inconsistent outcomes across runs
Details
The author ran 900 total runs (10 attempts each) of Devstral 2 and Claude Code (Sonnet 4.5) on the SWE-bench-verified-mini dataset, which contains 45 real GitHub issues. The results showed that Devstral 2 achieved 37.6% accuracy, while Claude Code achieved 39.8%, a gap that is within statistical error. This is significant because Devstral 2 is an open-weight model that the author can run on their own hardware, yet it matched the performance of Anthropic's recent model. The author also found that both models exhibited high variance, with about 40% of test cases having inconsistent outcomes across runs, even on cases solved 10 out of 10 times.
No comments yet
Be the first to comment