Implementing Google's TurboQuant on a Vision-Language Model
The author implemented Google's TurboQuant compression technique on a vision-language model processing video and tested it on a consumer GPU, uncovering insights not covered in the original paper.
Why it matters
The author's findings provide practical insights for implementing TurboQuant on real-world vision-language models, beyond the text-only scenarios covered in the original paper.
Key Points
- 14-bit nibble packing achieves better compression and quality than 3-bit unpacked
- 2FP16 norms fail silently at scale due to precision loss, requiring float32
- 3Fused Triton kernel with pre-rotated queries provides 17.8x speedup but causes output degeneration
Details
The author implemented TurboQuant, a technique that compresses transformer KV caches to 3-4 bits per coordinate with zero accuracy loss, on a vision-language model (Molmo2-4B) processing Seinfeld video clips. They found that 4-bit nibble packing provides better compression and quality than the 3-bit approach described in the paper. However, storing vector norms in FP16 led to output degeneration on longer sequences due to precision loss accumulating across transformer layers. The author also built a fused Triton kernel to compute Q @ compressed_K^T directly, achieving a 17.8x speedup, but this caused issues due to the model expecting bf16 attention behavior. The solution that worked was incremental dequantization, decompressing only the new token at each layer instead of the entire cache.
No comments yet
Be the first to comment