Optimizing NVLink for LLaMA Inference on Quad V100 GPUs

The article discusses the author's experience setting up a server with four NVIDIA V100 GPUs and optimizing the performance of the llama.cpp inference code using NVLink.

💡

Why it matters

This article provides insights into optimizing the performance of large language models like LLaMA on high-end GPU hardware, which is crucial for practical deployment and real-world applications.

Key Points

  • 1The author bought a server with four NVIDIA V100 GPUs and had to hack the power supply to get it working
  • 2The default 'row' split mode in llama.cpp is not optimized for NVLink, resulting in lower performance
  • 3Switching to the 'layer' split mode significantly improved the inference speed, from 70 tokens/s to 1683 tokens/s

Details

The article describes the author's journey in setting up a server with four NVIDIA V100 GPUs, each with 32GB of memory. They had to hack the power supply to get the system working properly. After a year of experimenting with other GPU configurations, the author returned to the V100 setup and started looking into optimizing the performance of the llama.cpp inference code. They found that the default 'row' split mode was not optimized for NVLink, resulting in relatively low performance of around 70 tokens/s. By switching to the 'layer' split mode, the performance improved significantly to around 1683 tokens/s. The author suggests that while the V100 setup is still expensive, it could be worth exploring if you can find cheap adapters to use the 16GB V100 SXM2 GPUs. They also hope that someone will eventually optimize the NVLink support in llama.cpp for better inference performance.

Like
Save
Read original
Cached
Comments
?

No comments yet

Be the first to comment

AI Curator - Daily AI News Curation

AI Curator

Your AI news assistant

Ask me anything about AI

I can help you understand AI news, trends, and technologies