LocalLLaMA Reddit5h ago|研究・論文プロダクト・サービス

Optimizing NVLink for LLaMA Inference on Quad V100 GPUs

The article discusses the author's experience setting up a server with four NVIDIA V100 GPUs and optimizing the performance of the llama.cpp inference code using NVLink.

💡

Why it matters

This article provides insights into optimizing the performance of large language models like LLaMA on high-end GPU hardware, which is crucial for practical deployment and real-world applications.

Key Points

1The author bought a server with four NVIDIA V100 GPUs and had to hack the power supply to get it working
2The default 'row' split mode in llama.cpp is not optimized for NVLink, resulting in lower performance
3Switching to the 'layer' split mode significantly improved the inference speed, from 70 tokens/s to 1683 tokens/s

Details

The article describes the author's journey in setting up a server with four NVIDIA V100 GPUs, each with 32GB of memory. They had to hack the power supply to get the system working properly. After a year of experimenting with other GPU configurations, the author returned to the V100 setup and started looking into optimizing the performance of the llama.cpp inference code. They found that the default 'row' split mode was not optimized for NVLink, resulting in relatively low performance of around 70 tokens/s. By switching to the 'layer' split mode, the performance improved significantly to around 1683 tokens/s. The author suggests that while the V100 setup is still expensive, it could be worth exploring if you can find cheap adapters to use the 16GB V100 SXM2 GPUs. They also hope that someone will eventually optimize the NVLink support in llama.cpp for better inference performance.

Optimizing NVLink for LLaMA Inference on Quad V100 GPUs

Why it matters

Key Points

Details

Dive deeper

Related Articles

Day 14: 21 Days of Building a Small Language Model: Positio…

Rustで作った10ms未満の高速ファイアウォール

RAG Paper 25.12.18

Revibe: A Rust Rewrite of Mistral Vibe

My experience quiet cooling 2 external/open-air Instinct MI…

Any regrets A6000 Pro owners?

Using local VLMs and SAM 3 to Agentically Segment Characters

Choosing the Best Model for Coding with 8-14B Parameters

I built a website that aggregates latest challenges in rese…

People are Speedrunning NanoGPT, Now in 127.7 Seconds

AI Curator

Ask me anything about AI

Related Articles

Day 14: 21 Days of Building a Small Language Model: Positio…

Revibe: A Rust Rewrite of Mistral Vibe

My experience quiet cooling 2 external/open-air Instinct MI…

Using local VLMs and SAM 3 to Agentically Segment Characters

Choosing the Best Model for Coding with 8-14B Parameters

I built a website that aggregates latest challenges in rese…

People are Speedrunning NanoGPT, Now in 127.7 Seconds