Python Library Reduces LLM Memory Usage by 80%
The article introduces a Python library called tqai that can significantly reduce the memory usage of large language models (LLMs) by compressing the key-value cache. The library implements a technique called TurboQuant, which involves rotating and quantizing the cache vectors.
Why it matters
This technology can significantly improve the feasibility of running large language models locally, reducing hardware requirements and enabling more widespread use of these powerful AI systems.
Key Points
- 1LLMs can consume large amounts of memory due to the key-value cache, which grows linearly with context length
- 2TurboQuant is a technique that rotates the cache vectors, quantizes them independently, and stores the norms separately
- 3The tqai library provides a simple API to apply this compression to Transformer-based models, reducing memory usage by up to 80%
- 4The library supports both PyTorch and MLX backends, and can be used with models like Llama, Qwen, and Mistral
Details
The article discusses the problem of high memory usage when running large language models (LLMs) locally, particularly due to the key-value cache where attention stores every key and value vector for every token. This cache can take up 30-50% of the total memory, even when the model weights themselves fit in memory. The standard approach of quantizing the model weights does not address this issue. The article then introduces the TurboQuant technique, published by Google Research, which provides a data-oblivious way to compress the key-value cache. The key steps are: 1) Rotate the cache vectors by a random orthogonal matrix to spread information uniformly, 2) Quantize each coordinate independently using precomputed optimal codebooks, and 3) Store the norms separately in FP16. This approach does not require any training or calibration data, and the same codebooks can be used across different models. The article then presents the tqai Python library, which implements this TurboQuant compression and provides a simple API to apply it to Transformer-based models, reducing memory usage by up to 80%.
No comments yet
Be the first to comment