Compress your LLM's KV cache 33x without training
This article introduces NexusQuant, a tool that can compress the key-value cache of large language models by up to 33x without requiring any training or fine-tuning.
Why it matters
NexusQuant's training-free compression can significantly improve the memory efficiency of deploying large language models in production.
Key Points
- 1NexusQuant eliminates the memory bottleneck of the KV cache in LLMs
- 2It uses techniques like importance scoring, token eviction, and quantization to achieve high compression ratios
- 3Supports popular LLMs like Llama, Mistral, and Qwen with minimal performance degradation
Details
The KV cache, which stores the intermediate attention values for each token, can easily consume 80GB of memory for a modern LLM with 128K context. NexusQuant addresses this by employing a multi-step compression pipeline: 1) Scoring token importance based on cross-head attention, 2) Evicting less important tokens while preserving a sliding window, 3) Removing the rotary position embeddings (RoPE) to align the key embeddings, 4) Applying Hadamard rotation to distribute energy uniformly, 5) Quantizing the values using the dense E8 lattice, and 6) Delta coding and Zstd compression on the indices. This combination of techniques can achieve 10x to 33x compression with only 0.4% to 2.6% perplexity degradation, allowing 4.2M tokens to fit in 80GB on an A100 GPU.
No comments yet
Be the first to comment