Dev.to LLM7h ago|Research & Papers Products & Services

Benchmarking 3 Qwen3.5 Models on an RTX 4060 8GB

The author tests 3 Qwen3.5 models (9B, 27B, 35B-A3B) on an RTX 4060 8GB GPU to understand the gap between spec sheet numbers and real-world usability. The results reveal surprising differences in speed, VRAM usage, and context length despite similar parameter counts.

💡

Why it matters

This article provides valuable insights into the real-world performance of large language models, beyond just parameter counts.

Key Points

1VRAM usage alone does not determine model speed - GPU utilization is key
235B-A3B MoE model outperforms 9B on GPU utilization due to its sparse activation
3Context length
4 varies by task, causing exhaustion on more complex prompts

Details

The author runs three qualitatively different tasks (code generation, knowledge synthesis, reasoning) on the 3 Qwen3.5 models to expose their unique characteristics. Despite similar VRAM usage (~7.5GB), the models show a 10x speed difference, with the 9B being fastest at 33 tokens/sec. This is due to the 27B and 35B-A3B models having partial GPU offloading, resulting in lower GPU utilization. Interestingly, the 35B-A3B MoE model achieves higher GPU utilization (95%) than the 9B (91%) by only activating a small subset of its 35B parameters. However, this comes at the cost of higher system RAM usage. The author also finds that the

Benchmarking 3 Qwen3.5 Models on an RTX 4060 8GB

Why it matters

Key Points

Details

Dive deeper

Related Articles

How to Give Your AI Agent the Ability to Read Any Webpage

Agentic Engineering: Lessons Learned Vol. 2

Agentic AI Architecture: Deploying Autonomous AI in Product…

Guardrails for AI Systems: The Architecture of Controlled T…

The Prompt Engineering Journey: Successes and Failures

Building a Coding Mentor with Persistent Memory

Fixing Recommendation Loops with Hindsight Memory

The Single Best Way to Reduce LLM Costs (It Is Not What You…

Comprehensive Review of 6 LLM Monitoring Tools

Enforcing LLM Spend Limits Per Team Without Slowing Down En…

AI Curator

Ask me anything about AI

Related Articles

How to Give Your AI Agent the Ability to Read Any Webpage

Agentic Engineering: Lessons Learned Vol. 2

Agentic AI Architecture: Deploying Autonomous AI in Product…

Guardrails for AI Systems: The Architecture of Controlled T…

The Prompt Engineering Journey: Successes and Failures

Building a Coding Mentor with Persistent Memory

Fixing Recommendation Loops with Hindsight Memory

The Single Best Way to Reduce LLM Costs (It Is Not What You…

Comprehensive Review of 6 LLM Monitoring Tools

Enforcing LLM Spend Limits Per Team Without Slowing Down En…