Dev.to Machine Learning9h ago|Research & PapersProducts & Services

Python Library Reduces LLM Memory Usage by 80%

The article introduces a Python library called tqai that can significantly reduce the memory usage of large language models (LLMs) by compressing the key-value cache. The library implements a technique called TurboQuant, which involves rotating and quantizing the cache vectors.

đź’ˇ

Why it matters

This technology can significantly improve the feasibility of running large language models locally, reducing hardware requirements and enabling more widespread use of these powerful AI systems.

Key Points

  • 1LLMs can consume large amounts of memory due to the key-value cache, which grows linearly with context length
  • 2TurboQuant is a technique that rotates the cache vectors, quantizes them independently, and stores the norms separately
  • 3The tqai library provides a simple API to apply this compression to Transformer-based models, reducing memory usage by up to 80%
  • 4The library supports both PyTorch and MLX backends, and can be used with models like Llama, Qwen, and Mistral

Details

The article discusses the problem of high memory usage when running large language models (LLMs) locally, particularly due to the key-value cache where attention stores every key and value vector for every token. This cache can take up 30-50% of the total memory, even when the model weights themselves fit in memory. The standard approach of quantizing the model weights does not address this issue. The article then introduces the TurboQuant technique, published by Google Research, which provides a data-oblivious way to compress the key-value cache. The key steps are: 1) Rotate the cache vectors by a random orthogonal matrix to spread information uniformly, 2) Quantize each coordinate independently using precomputed optimal codebooks, and 3) Store the norms separately in FP16. This approach does not require any training or calibration data, and the same codebooks can be used across different models. The article then presents the tqai Python library, which implements this TurboQuant compression and provides a simple API to apply it to Transformer-based models, reducing memory usage by up to 80%.

Like
Save
Read original
Cached
Comments
?

No comments yet

Be the first to comment

AI Curator - Daily AI News Curation

AI Curator

Your AI news assistant

Ask me anything about AI

I can help you understand AI news, trends, and technologies