Dev.to Machine Learning2h ago|Business & Industry Products & Services

Ollama's Significant Performance Boost on Mac with Apple's MLX Framework

Ollama, a popular AI inference tool, has released a major update that significantly improves its performance on Mac by leveraging Apple's MLX framework. The article highlights the speed improvements, support for NVFP4 quantization, and the implications for local AI inference becoming a viable alternative to cloud-based solutions.

💡

Why it matters

This update to Ollama showcases the growing viability of local AI inference as a legitimate alternative to cloud-based solutions, reducing the need for API access and enabling more seamless development and deployment workflows.

Key Points

1Ollama 0.19 rebuilt its Mac backend on top of Apple's MLX framework, resulting in nearly 2x performance improvement
2The update enables faster response times for local AI agents like Claude Code and OpenCode, reducing the need for cloud API access
3Ollama now supports NVFP4 quantization, allowing local models to match the behavior of production cloud endpoints
4Caching improvements help maintain context and reduce processing overhead for agentic workflows

Details

Ollama, a popular AI inference tool, has released version 0.19 that significantly boosts its performance on Mac by leveraging Apple's MLX framework. The MLX framework is designed to take full advantage of Apple's unified memory architecture, where the CPU and GPU share the same memory pool, reducing the overhead of data copying. This allows Ollama to achieve remarkable speed improvements, with 1,851 tokens per second on prefill and 134 tokens per second on decode, roughly twice as fast as the previous version on the same hardware. The update also enables Ollama to tap into the GPU Neural Accelerators on M5 chips, further widening the performance gap compared to older silicon. Additionally, Ollama now supports NVFP4 quantization, a 4-bit floating-point format used by many cloud inference providers, ensuring that local models behave the same as their production counterparts. The article also highlights Ollama's caching improvements, which help maintain context and reduce processing overhead for agentic workflows involving multiple tool calls.

Ollama's Significant Performance Boost on Mac with Apple's MLX Framework

Why it matters

Key Points

Details

Dive deeper

Related Articles

Building an Autonomous SOC: How NAPSE and AEGIS Replace Man…

HookProbe Hydra Engine Neutralizes Edge-Based IP Threats

Building a Multi-Agent AI System: How We Made 20 Agents Wor…

Video Super-Resolution Transformer

Beyond the Hype: A Practical Guide to Integrating AI into Y…

Shannon Information and Kolmogorov Complexity

AI, ML and Computer Vision Meetup on April 2

Mastering Data Analytics on AWS with a Strategic Approach

Scikit-Learn Projects: SVM Iris Classification, KNN Flower …

The Statistical Complexity of Interactive Decision Making

AI Curator

Ask me anything about AI

Related Articles

Building an Autonomous SOC: How NAPSE and AEGIS Replace Man…

HookProbe Hydra Engine Neutralizes Edge-Based IP Threats

Building a Multi-Agent AI System: How We Made 20 Agents Wor…

Video Super-Resolution Transformer

Beyond the Hype: A Practical Guide to Integrating AI into Y…

Shannon Information and Kolmogorov Complexity

AI, ML and Computer Vision Meetup on April 2

Mastering Data Analytics on AWS with a Strategic Approach

Scikit-Learn Projects: SVM Iris Classification, KNN Flower …

The Statistical Complexity of Interactive Decision Making