Dev.to Machine Learning3h ago|Research & PapersProducts & Services

Compressing Vision-Language Models with TurboQuant

The author implemented TurboQuant, a technique from Google that compresses transformer KV caches, as a plugin for vision-language models. They tested it on a large video model and found significant memory savings with minimal quality loss.

đź’ˇ

Why it matters

Compressing vision-language models is crucial for deploying them on consumer hardware and enabling real-time video processing.

Key Points

  • 1TurboQuant compresses transformer KV caches by 4-5x with no accuracy loss
  • 2The author tested it on a vision-language model processing video, which has much larger KV caches than text models
  • 3The plugin approach allows easy integration without forking or modifying the model code
  • 4The author found and fixed several bugs not present in the original TurboQuant paper

Details

Google recently published TurboQuant, a technique that can compress transformer KV caches by 4-5x with no accuracy loss. The author wanted to test if this technique would work on vision-language models processing video, which can have 10x larger KV caches than text-only models. They developed a TurboQuant plugin for the vLLM framework that can be easily integrated without modifying the model code. The results show that TurboQuant compression works well on the Molmo2 vision-language model, reducing the 1.6GB KV cache to just 435MB on an RTX 4090 GPU. The output quality was nearly identical to the uncompressed baseline. The author also developed an incremental dequantization approach that reduced the decompression overhead from 3.36x to 1.78x. Additionally, the author found and fixed several bugs that were not present in the original TurboQuant paper, including issues with FP16 norms, QJL correction, and multi-layer precision drift in fused kernels. These discoveries highlight the importance of validating new compression techniques across a variety of model architectures and hardware.

Like
Save
Read original
Cached
Comments
?

No comments yet

Be the first to comment

AI Curator - Daily AI News Curation

AI Curator

Your AI news assistant

Ask me anything about AI

I can help you understand AI news, trends, and technologies