Benchmarking 3 Qwen3.5 Models on an RTX 4060 8GB
The author tests 3 Qwen3.5 models (9B, 27B, 35B-A3B) on an RTX 4060 8GB GPU to understand the gap between spec sheet numbers and real-world usability. The results reveal surprising differences in speed, VRAM usage, and context length despite similar parameter counts.
Why it matters
This article provides valuable insights into the real-world performance of large language models, beyond just parameter counts.
Key Points
- 1VRAM usage alone does not determine model speed - GPU utilization is key
- 235B-A3B MoE model outperforms 9B on GPU utilization due to its sparse activation
- 3Context length
- 4 varies by task, causing exhaustion on more complex prompts
Details
The author runs three qualitatively different tasks (code generation, knowledge synthesis, reasoning) on the 3 Qwen3.5 models to expose their unique characteristics. Despite similar VRAM usage (~7.5GB), the models show a 10x speed difference, with the 9B being fastest at 33 tokens/sec. This is due to the 27B and 35B-A3B models having partial GPU offloading, resulting in lower GPU utilization. Interestingly, the 35B-A3B MoE model achieves higher GPU utilization (95%) than the 9B (91%) by only activating a small subset of its 35B parameters. However, this comes at the cost of higher system RAM usage. The author also finds that the
No comments yet
Be the first to comment