If Memory Could Compute, Would We Still Need GPUs?
The bottleneck for large language model (LLM) inference is not GPU compute, but memory bandwidth. Processing-in-Memory (PIM) architectures aim to address this by computing where the data lives, reducing data movement.
Why it matters
PIM architectures have the potential to dramatically improve the efficiency of large language model inference, which is currently limited by memory bandwidth rather than compute power.
Key Points
- 1LLM inference has two phases - prefill (compute-bound) and decode (memory bandwidth-bound)
- 2GPUs spend most of their time idle during the decode phase, waiting for data
- 3PIM architectures like SK Hynix's AiM and Samsung's LPDDR5X-PIM integrate compute units into memory to eliminate the memory bandwidth bottleneck
- 4Upcoming HBM4 memory will integrate logic dies, turning the memory stack itself into a co-processor
Details
The core idea behind PIM is to compute where the data lives, eliminating the need to move data back and forth between memory and compute units. This addresses the 'memory wall' problem, where GPU arithmetic units sit idle for most of the LLM decode phase, waiting for data to arrive from memory. PIM architectures like SK Hynix's AiM and Samsung's LPDDR5X-PIM integrate compute units directly into the memory, providing orders of magnitude higher internal bandwidth compared to external bus bandwidth. Upcoming HBM4 memory will take this further by integrating logic dies into the memory stack, turning it into a co-processor. While the GPU era is not ending, PIM will significantly change the LLM inference architecture, reducing data movement and improving efficiency.
No comments yet
Be the first to comment