Longer Contexts are Easier to Compress, Not Harder
Experiments show that longer input sequences are actually easier to compress than shorter ones, contrary to the common assumption. Longer contexts allow the importance scorer to better identify and evict less relevant tokens.
Why it matters
This finding has significant implications for efficient LLM inference, as it shows that longer contexts can be compressed more effectively than previously thought.
Key Points
- 1Longer input sequences (1,600 tokens) show significantly less quality degradation from compression compared to shorter sequences (500 tokens)
- 2Longer contexts allow the importance scorer to better distinguish relevant from irrelevant tokens, enabling safer eviction
- 3Production LLM inference typically uses thousands of tokens, so short-context benchmarks understate the compression quality of eviction-based methods
Details
The article presents experiments showing that longer input sequences are easier to compress than shorter ones, contrary to the common assumption. Using the same model and compression method, the authors found that at a 60% eviction rate, a 500-token input had a 4.5% increase in perplexity, while a 1,600-token input had only a 0.82% increase. This is because the importance scorer can better identify and evict less relevant tokens in longer contexts, as it has more query positions to aggregate attention over. With more data, the attention distribution becomes sharper, allowing the scorer to more confidently separate signal from noise. The authors' NexusQuant library achieves 10x compression at 500 tokens, 17x at 1,600 tokens, and 33x at any context length with low perplexity degradation. This suggests that production LLM inference, which typically uses thousands of tokens, can be more aggressive with compression than short-context benchmarks indicate.
No comments yet
Be the first to comment