Light Reduces KV Cache Memory Traffic by 16x for LLM Inference
A new paper proposes PRISM, a technique that offloads key-value cache block selection to photonic circuits, reducing memory traffic by 16x for long-context LLM inference.
Why it matters
Solving the memory bandwidth bottleneck is critical for scaling up large language models to handle longer contexts without performance degradation.
Key Points
- 1Memory bandwidth, not compute, is the bottleneck for long-context LLM inference
- 2Existing solutions like Top-K attention have limitations in finding the most relevant KV blocks
- 3PRISM uses photonic circuits to perform O(1) KV cache block selection, cutting memory traffic by 16x
- 4PRISM achieves 10,000x energy efficiency improvement in block selection with no accuracy loss
Details
The core problem is that every decode step in a Transformer model needs to scan the entire key-value (KV) cache to generate a single output token. This O(n) memory access pattern creates a fundamental bottleneck that doesn't improve even as GPU compute power increases. Existing solutions like Top-K attention try to reduce KV cache reads, but the block selection process itself is still O(n) complex. The new PRISM technique offloads this block selection to photonic circuits, making it O(1) complexity. This results in a 16x reduction in memory traffic for 64K token contexts, with a 10,000x improvement in energy efficiency for the block selection process, all while maintaining 100% accuracy.
No comments yet
Be the first to comment