Latest METR results show Claude Opus 4.5 has a 50%-time horizon of around 4 hrs 49 mins

The latest METR (Measuring AI Ability) results show that the Claude Opus 4.5 large language model has a 50%-time horizon of around 4 hours and 49 minutes, which is the biggest jump in LLM capabilities ever recorded.

💡

Why it matters

This breakthrough in Claude's long-task capabilities represents a major milestone in the advancement of large language models and their real-world applications.

Key Points

  • 1METR measures model success on multi-step software and reasoning tasks based on the time taken by human experts
  • 2Current models have almost 100% success rate on tasks taking humans less than 4 minutes, but succeed less than 10% of the time on tasks taking more than 4 hours
  • 3The Claude Opus 4.5 model has achieved a 50%-time horizon of around 4 hours and 49 minutes, a significant improvement in long-task capabilities

Details

The METR (Measuring AI Ability) metric evaluates large language models (LLMs) by recording the time needed for human experts to complete a diverse set of multi-step software and reasoning tasks. They have found that the time taken by humans is strongly predictive of model success - current models have almost 100% success rate on tasks taking less than 4 minutes, but succeed less than 10% of the time on tasks taking more than 4 hours. The latest METR results show that the Claude Opus 4.5 model has achieved a 50%-time horizon of around 4 hours and 49 minutes, which is the biggest jump in LLM capabilities ever recorded. This indicates a significant improvement in the model's ability to handle long, complex tasks compared to previous versions.

Like
Save
Read original
Cached
Comments
?

No comments yet

Be the first to comment

AI Curator - Daily AI News Curation

AI Curator

Your AI news assistant

Ask me anything about AI

I can help you understand AI news, trends, and technologies