Compressing Vision-Language Models with TurboQuant
The author implemented TurboQuant, a technique from Google that compresses transformer KV caches, as a plugin for vision-language models. They tested it on a large video model and found significant memory savings with minimal quality loss.
Why it matters
Compressing vision-language models is crucial for deploying them on consumer hardware and enabling real-time video processing.
Key Points
- 1TurboQuant compresses transformer KV caches by 4-5x with no accuracy loss
- 2The author tested it on a vision-language model processing video, which has much larger KV caches than text models
- 3The plugin approach allows easy integration without forking or modifying the model code
- 4The author found and fixed several bugs not present in the original TurboQuant paper
Details
Google recently published TurboQuant, a technique that can compress transformer KV caches by 4-5x with no accuracy loss. The author wanted to test if this technique would work on vision-language models processing video, which can have 10x larger KV caches than text-only models. They developed a TurboQuant plugin for the vLLM framework that can be easily integrated without modifying the model code. The results show that TurboQuant compression works well on the Molmo2 vision-language model, reducing the 1.6GB KV cache to just 435MB on an RTX 4090 GPU. The output quality was nearly identical to the uncompressed baseline. The author also developed an incremental dequantization approach that reduced the decompression overhead from 3.36x to 1.78x. Additionally, the author found and fixed several bugs that were not present in the original TurboQuant paper, including issues with FP16 norms, QJL correction, and multi-layer precision drift in fused kernels. These discoveries highlight the importance of validating new compression techniques across a variety of model architectures and hardware.
No comments yet
Be the first to comment