7B Parameters Does Not Mean 8GB VRAM Is Enough
This article discusses the common misconception that a 7B parameter model can be run on an 8GB VRAM GPU. It explains that parameter count is not the full story, and factors like context length, quantization, batching, and runtime overhead can significantly impact the VRAM requirements.
Why it matters
This article provides important insights for developers and researchers working with large language models, highlighting the need to consider factors beyond just parameter count when estimating VRAM requirements.
Key Points
- 1Parameter count does not determine the full memory bill
- 2KV cache grows with context length, increasing VRAM usage
- 3Quantization reduces weight memory but not all other costs
- 4Batching and runtime stack choices also affect VRAM requirements
Details
The article explains that a 7B parameter model may feel easy to run in a demo, but can become problematic in a real-world application. This is because the VRAM requirements are not solely determined by the parameter count. Other factors like context length, quantization, batching, and the runtime stack (e.g., vLLM, TGI, custom stacks) can significantly impact the VRAM usage. For example, as the context length increases, the KV cache grows, leading to higher VRAM consumption. Quantization reduces the weight memory but does not eliminate all other costs. Batching and the specific runtime choices can also push the setup over the edge. The article recommends treating 8GB VRAM as potentially enough for small experiments, but not safe for production inference, and leaving margin for real-world applications.
No comments yet
Be the first to comment