The E8 Lattice: The Perfect Quantizer for KV Caches
The article explores how the E8 lattice, a mathematical structure with optimal packing density in 8 dimensions, is the ideal quantizer for KV cache vectors in large language models.
Why it matters
This novel quantization technique based on the E8 lattice structure can significantly improve the efficiency of large language models without compromising accuracy.
Key Points
- 1E8 lattice has the highest possible kissing number (240 nearest neighbors) in 8 dimensions
- 2Hadamard-transformed KV cache vectors follow a sub-Gaussian distribution, which aligns well with E8's structure
- 3Relaxing the strict even-sum parity constraint on E8 codewords improves quantization error by 0.3-0.4%
- 4The E8-based quantization pipeline outperforms INT8 uniform and Product Quantization on KV cache data
Details
The E8 lattice is a special mathematical structure with optimal packing density in 8 dimensions. This makes it an ideal choice for quantizing KV cache vectors in large language models, as these vectors tend to follow a sub-Gaussian distribution after a Hadamard transform is applied. The shell structure of E8 aligns well with the probability mass distribution of a spherically symmetric Gaussian, allowing for more codewords where the data is concentrated. Interestingly, the authors found that relaxing the strict even-sum parity constraint on E8 codewords further improves the quantization performance, as it restores codepoints near the origin where sub-Gaussian data is more likely to be. This E8-based quantization pipeline outperforms traditional methods like INT8 uniform and Product Quantization, achieving a 22% reduction in mean squared error and even improving model perplexity compared to fp16 on the Mistral-7B KV cache data.
No comments yet
Be the first to comment