Qwen 3.6 Ollama Release, Consumer GPU Benchmarks, GGUF Quantization Fixes

This article covers the release of Qwen 3.6 models on the Ollama platform, performance optimizations for running Qwen 3.6 on consumer hardware, and a technique to enhance GGUF quantization quality.

đź’ˇ

Why it matters

These developments make high-performance open-weight models like Qwen 3.6 more accessible and practical for local deployment, furthering the growth of the self-hosted AI ecosystem.

Key Points

  • 1Qwen 3.6 35B-A3B Mixture-of-Experts (MoE) model now available on Ollama with optimized quantization levels
  • 2Significant performance gains achieved on consumer hardware like RTX 5070 Ti + 9800X3D using the --n-cpu-moe flag
  • 3A solution to fix the 'ssm_conv1d tensor drift' issue in GGUF quantized models using the Wasserstein metric

Details

The article announces the release of the Qwen 3.6 35B-A3B MoE model on the Ollama platform, providing easy access to this powerful open-weight model with various quantization levels tailored for efficient local inference on consumer hardware, especially Mac systems. The release features the iq3 (13 GB) and iq4 (18 GB) quantization levels, making Qwen 3.6 more accessible for a wider range of users. A notable performance benchmark is also shared, showcasing the Qwen 3.6 35B-A3B model running at 79 tokens per second on an RTX 5070 Ti GPU and 9800X3D CPU, with the --n-cpu-moe flag being the critical optimization. Additionally, a solution to address the 'ssm_conv1d tensor drift' issue in GGUF quantized models is presented, involving the use of the Wasserstein metric to minimize the drift and maintain higher fidelity to the original unquantized model.

Like
Save
Read original
Cached
Comments
?

No comments yet

Be the first to comment

AI Curator - Daily AI News Curation

AI Curator

Your AI news assistant

Ask me anything about AI

I can help you understand AI news, trends, and technologies