Ollama's Significant Performance Boost on Mac with Apple's MLX Framework
Ollama, a popular AI inference tool, has released a major update that significantly improves its performance on Mac by leveraging Apple's MLX framework. The article highlights the speed improvements, support for NVFP4 quantization, and the implications for local AI inference becoming a viable alternative to cloud-based solutions.
Why it matters
This update to Ollama showcases the growing viability of local AI inference as a legitimate alternative to cloud-based solutions, reducing the need for API access and enabling more seamless development and deployment workflows.
Key Points
- 1Ollama 0.19 rebuilt its Mac backend on top of Apple's MLX framework, resulting in nearly 2x performance improvement
- 2The update enables faster response times for local AI agents like Claude Code and OpenCode, reducing the need for cloud API access
- 3Ollama now supports NVFP4 quantization, allowing local models to match the behavior of production cloud endpoints
- 4Caching improvements help maintain context and reduce processing overhead for agentic workflows
Details
Ollama, a popular AI inference tool, has released version 0.19 that significantly boosts its performance on Mac by leveraging Apple's MLX framework. The MLX framework is designed to take full advantage of Apple's unified memory architecture, where the CPU and GPU share the same memory pool, reducing the overhead of data copying. This allows Ollama to achieve remarkable speed improvements, with 1,851 tokens per second on prefill and 134 tokens per second on decode, roughly twice as fast as the previous version on the same hardware. The update also enables Ollama to tap into the GPU Neural Accelerators on M5 chips, further widening the performance gap compared to older silicon. Additionally, Ollama now supports NVFP4 quantization, a 4-bit floating-point format used by many cloud inference providers, ensuring that local models behave the same as their production counterparts. The article also highlights Ollama's caching improvements, which help maintain context and reduce processing overhead for agentic workflows involving multiple tool calls.
No comments yet
Be the first to comment