Dev.to Machine Learning3h ago|Research & PapersProducts & Services

Implementing Google's TurboQuant on a Vision-Language Model

The author implemented Google's TurboQuant compression technique on a vision-language model processing video and tested it on a consumer GPU, uncovering insights not covered in the original paper.

💡

Why it matters

The author's findings provide practical insights for implementing TurboQuant on real-world vision-language models, beyond the text-only scenarios covered in the original paper.

Key Points

  • 14-bit nibble packing achieves better compression and quality than 3-bit unpacked
  • 2FP16 norms fail silently at scale due to precision loss, requiring float32
  • 3Fused Triton kernel with pre-rotated queries provides 17.8x speedup but causes output degeneration

Details

The author implemented TurboQuant, a technique that compresses transformer KV caches to 3-4 bits per coordinate with zero accuracy loss, on a vision-language model (Molmo2-4B) processing Seinfeld video clips. They found that 4-bit nibble packing provides better compression and quality than the 3-bit approach described in the paper. However, storing vector norms in FP16 led to output degeneration on longer sequences due to precision loss accumulating across transformer layers. The author also built a fused Triton kernel to compute Q @ compressed_K^T directly, achieving a 17.8x speedup, but this caused issues due to the model expecting bf16 attention behavior. The solution that worked was incremental dequantization, decompressing only the new token at each layer instead of the entire cache.

Like
Save
Read original
Cached
Comments
?

No comments yet

Be the first to comment

AI Curator - Daily AI News Curation

AI Curator

Your AI news assistant

Ask me anything about AI

I can help you understand AI news, trends, and technologies