Qwen 3.6 Ollama Release, Consumer GPU Benchmarks, GGUF Quantization Fixes
This article covers the release of Qwen 3.6 models on the Ollama platform, performance optimizations for running Qwen 3.6 on consumer hardware, and a technique to enhance GGUF quantization quality.
Why it matters
These developments make high-performance open-weight models like Qwen 3.6 more accessible and practical for local deployment, furthering the growth of the self-hosted AI ecosystem.
Key Points
- 1Qwen 3.6 35B-A3B Mixture-of-Experts (MoE) model now available on Ollama with optimized quantization levels
- 2Significant performance gains achieved on consumer hardware like RTX 5070 Ti + 9800X3D using the --n-cpu-moe flag
- 3A solution to fix the 'ssm_conv1d tensor drift' issue in GGUF quantized models using the Wasserstein metric
Details
The article announces the release of the Qwen 3.6 35B-A3B MoE model on the Ollama platform, providing easy access to this powerful open-weight model with various quantization levels tailored for efficient local inference on consumer hardware, especially Mac systems. The release features the iq3 (13 GB) and iq4 (18 GB) quantization levels, making Qwen 3.6 more accessible for a wider range of users. A notable performance benchmark is also shared, showcasing the Qwen 3.6 35B-A3B model running at 79 tokens per second on an RTX 5070 Ti GPU and 9800X3D CPU, with the --n-cpu-moe flag being the critical optimization. Additionally, a solution to address the 'ssm_conv1d tensor drift' issue in GGUF quantized models is presented, involving the use of the Wasserstein metric to minimize the drift and maintain higher fidelity to the original unquantized model.
No comments yet
Be the first to comment