Gradient Accumulation vs Large Batch: Memory & Cost Test
The article explores the trade-offs between using gradient accumulation and large batch sizes for training deep learning models, focusing on memory usage and training costs.
Why it matters
Understanding the memory and cost implications of gradient accumulation versus large batch sizes is crucial for optimizing the training of deep learning models, especially on resource-constrained hardware.
Key Points
- 1Gradient accumulation can lead to unexpected memory issues, contrary to the common belief that it saves memory
- 2The article compares two training strategies on an A100 GPU: batch size 128 vs batch size 8 with gradient accumulation of 16 steps
- 3Both strategies have the same effective batch size, but the memory usage and training costs can differ significantly
Details
The article discusses the common misconception that gradient accumulation can effectively increase the batch size without increasing memory usage. It presents a case study where developers migrated from a batch size of 32 to gradient accumulation, expecting to save money, but ended up encountering out-of-memory (OOM) issues much earlier in the training process. The article then compares two training strategies on an A100 GPU: one with a batch size of 128 and no gradient accumulation, and another with a batch size of 8 and gradient accumulation of 16 steps (effectively a batch size of 128). The author provides real memory profiles and AWS cost data to demonstrate that the memory savings from gradient accumulation are not as straightforward as they may seem. The article aims to highlight the edge cases and potential pitfalls that developers should be aware of when choosing between large batch sizes and gradient accumulation.
No comments yet
Be the first to comment