Lessons Learned from 12 Failed Compression Approaches for AI Models
The article discusses 12 compression techniques that the author tested for KV cache compression, but ultimately failed to achieve the desired results. It provides insights into why these approaches did not work and the lessons learned.
Why it matters
The insights from these failed experiments can help researchers and engineers working on AI model compression avoid common pitfalls and save time.
Key Points
- 1PCA rotation performed worse than Hadamard rotation due to distribution shifts
- 2Larger group sizes for per-group scaling led to quality degradation
- 3Adaptive bitwidth allocation provided negligible gains over flat quantization
- 4Per-head token eviction caused catastrophic performance issues
- 5Token merging destroyed positional information and led to significant PPL degradation
- 6Entropy coding of lattice indices without delta coding resulted in poor compression
Details
The author tested various compression techniques for KV cache, including PCA rotation, larger group sizes for per-group scaling, adaptive bitwidth allocation, per-head token eviction, token merging, and entropy coding of lattice indices. While these approaches sounded promising in theory, they all failed to deliver the expected results in practice. The key lessons learned include: data-free rotations outperform data-fitted rotations when distribution shift is unavoidable, larger group sizes for scaling trade off against quantization accuracy, token eviction and quantization solve related problems so doing both is redundant, KV cache is shared infrastructure so eviction must operate on the shared sequence, and token position is semantic, not just a coordinate, so merging destroys important information. The author emphasizes that negative results build trust and save time for others working on similar problems.
No comments yet
Be the first to comment