Benchmarking 3 Qwen3.5 Models on an RTX 4060 8GB

The author tests 3 Qwen3.5 models (9B, 27B, 35B-A3B) on an RTX 4060 8GB GPU to understand the gap between spec sheet numbers and real-world usability. The results reveal surprising differences in speed, VRAM usage, and context length despite similar parameter counts.

đź’ˇ

Why it matters

This article provides valuable insights into the real-world performance of large language models, beyond just parameter counts.

Key Points

  • 1VRAM usage alone does not determine model speed - GPU utilization is key
  • 235B-A3B MoE model outperforms 9B on GPU utilization due to its sparse activation
  • 3Context length
  • 4 varies by task, causing exhaustion on more complex prompts

Details

The author runs three qualitatively different tasks (code generation, knowledge synthesis, reasoning) on the 3 Qwen3.5 models to expose their unique characteristics. Despite similar VRAM usage (~7.5GB), the models show a 10x speed difference, with the 9B being fastest at 33 tokens/sec. This is due to the 27B and 35B-A3B models having partial GPU offloading, resulting in lower GPU utilization. Interestingly, the 35B-A3B MoE model achieves higher GPU utilization (95%) than the 9B (91%) by only activating a small subset of its 35B parameters. However, this comes at the cost of higher system RAM usage. The author also finds that the

Like
Save
Read original
Cached
Comments
?

No comments yet

Be the first to comment

AI Curator - Daily AI News Curation

AI Curator

Your AI news assistant

Ask me anything about AI

I can help you understand AI news, trends, and technologies