How TurboQuant Reduces RAM Usage for Large Language Models
This article explains how large language models (LLMs) store words as high-dimensional vectors, which leads to massive memory usage during inference. It introduces TurboQuant, a technique that compresses these vectors using a scale and code approach to significantly reduce RAM requirements.
Why it matters
Reducing the memory usage of large language models is crucial for scaling these models to real-world applications and making them more accessible.
Key Points
- 1LLMs store words as high-dimensional vectors, not as text
- 2These vectors get transformed and expanded as they move through the model, leading to hundreds of thousands of numbers per token
- 3The KV cache that stores these numbers can consume gigabytes of RAM, even for a single conversation
- 4TurboQuant uses a scale and code approach to compress these vectors, reducing RAM usage without sacrificing accuracy
Details
Large language models (LLMs) like GPT-3 and DALL-E do not store words as text, but rather as high-dimensional vectors of numbers. These vectors represent the semantic relationships between words, with similar words located close together in the vector space. However, as a token moves through the model, it gets transformed at each layer, resulting in the storage of thousands of numbers per token in the KV cache. For a conversation with 2,000 tokens, this can lead to over 500 million numbers, consuming around 1GB of RAM. Simply reducing the precision of these numbers (e.g., using 8-bit instead of 16-bit) is not enough, as it can distort the subtle numerical relationships that the model relies on for accurate attention calculations. TurboQuant addresses this by storing the numbers in a structured way using a scale and a small integer code, allowing for significant compression without sacrificing model accuracy. This approach reduces the memory footprint of the KV cache, enabling LLMs to handle longer conversations and serve more users simultaneously.
No comments yet
Be the first to comment