Dev.to Machine Learning2h ago|Business & IndustryProducts & Services

Ollama's Significant Performance Boost on Mac with Apple's MLX Framework

Ollama, a popular AI inference tool, has released a major update that significantly improves its performance on Mac by leveraging Apple's MLX framework. The article highlights the speed improvements, support for NVFP4 quantization, and the implications for local AI inference becoming a viable alternative to cloud-based solutions.

💡

Why it matters

This update to Ollama showcases the growing viability of local AI inference as a legitimate alternative to cloud-based solutions, reducing the need for API access and enabling more seamless development and deployment workflows.

Key Points

  • 1Ollama 0.19 rebuilt its Mac backend on top of Apple's MLX framework, resulting in nearly 2x performance improvement
  • 2The update enables faster response times for local AI agents like Claude Code and OpenCode, reducing the need for cloud API access
  • 3Ollama now supports NVFP4 quantization, allowing local models to match the behavior of production cloud endpoints
  • 4Caching improvements help maintain context and reduce processing overhead for agentic workflows

Details

Ollama, a popular AI inference tool, has released version 0.19 that significantly boosts its performance on Mac by leveraging Apple's MLX framework. The MLX framework is designed to take full advantage of Apple's unified memory architecture, where the CPU and GPU share the same memory pool, reducing the overhead of data copying. This allows Ollama to achieve remarkable speed improvements, with 1,851 tokens per second on prefill and 134 tokens per second on decode, roughly twice as fast as the previous version on the same hardware. The update also enables Ollama to tap into the GPU Neural Accelerators on M5 chips, further widening the performance gap compared to older silicon. Additionally, Ollama now supports NVFP4 quantization, a 4-bit floating-point format used by many cloud inference providers, ensuring that local models behave the same as their production counterparts. The article also highlights Ollama's caching improvements, which help maintain context and reduce processing overhead for agentic workflows involving multiple tool calls.

Like
Save
Read original
Cached
Comments
?

No comments yet

Be the first to comment

AI Curator - Daily AI News Curation

AI Curator

Your AI news assistant

Ask me anything about AI

I can help you understand AI news, trends, and technologies