MolmoWeb 4B/8B: Multimodal Web Agents Outperform Larger Models
MolmoWeb is a family of open multimodal web agents that achieve state-of-the-art results, outperforming similar scale open-weight-only models and even larger closed frontier models like GPT-4.
Why it matters
MolmoWeb demonstrates the potential of open multimodal models to match or exceed the performance of larger closed-source models, highlighting the importance of open AI research and development.
Key Points
- 1MolmoWeb agents outperform open-weight-only models like Fara-7B, UI-Tars-1.5-7B, and Holo1-7B
- 2MolmoWeb-8B surpasses set-of-marks (SoM) agents built on much larger closed frontier models like GPT-4
- 3Consistent gains through test-time scaling via parallel rollouts with best-of-N selection
Details
MolmoWeb is a family of fully open multimodal web agents developed by the Allen Institute for AI. The MolmoWeb agents use the Molmo2 architecture, which leverages the Qwen3-8B and SigLIP 2 vision backbones. These models have achieved state-of-the-art results, outperforming similar scale open-weight-only models as well as larger closed frontier models like GPT-4. The key innovation is the ability to achieve consistent performance gains through test-time scaling via parallel rollouts and best-of-N selection, leading to significant improvements in pass@4 metrics on benchmark tasks.
No comments yet
Be the first to comment