Latest METR results show Claude Opus 4.5 has a 50%-time horizon of around 4 hrs 49 mins
The latest METR (Measuring AI Ability) results show that the Claude Opus 4.5 large language model has a 50%-time horizon of around 4 hours and 49 minutes, which is the biggest jump in LLM capabilities ever recorded.
Why it matters
This breakthrough in Claude's long-task capabilities represents a major milestone in the advancement of large language models and their real-world applications.
Key Points
- 1METR measures model success on multi-step software and reasoning tasks based on the time taken by human experts
- 2Current models have almost 100% success rate on tasks taking humans less than 4 minutes, but succeed less than 10% of the time on tasks taking more than 4 hours
- 3The Claude Opus 4.5 model has achieved a 50%-time horizon of around 4 hours and 49 minutes, a significant improvement in long-task capabilities
Details
The METR (Measuring AI Ability) metric evaluates large language models (LLMs) by recording the time needed for human experts to complete a diverse set of multi-step software and reasoning tasks. They have found that the time taken by humans is strongly predictive of model success - current models have almost 100% success rate on tasks taking less than 4 minutes, but succeed less than 10% of the time on tasks taking more than 4 hours. The latest METR results show that the Claude Opus 4.5 model has achieved a 50%-time horizon of around 4 hours and 49 minutes, which is the biggest jump in LLM capabilities ever recorded. This indicates a significant improvement in the model's ability to handle long, complex tasks compared to previous versions.
No comments yet
Be the first to comment