Spent weekend tuning LLM server to hone my nerdism
The author spent time setting up a local AI server with various large language models (LLMs) for chat and coding tasks, aiming to replace Ollama with llama.cpp and maximize performance on their hardware (Dual RTX 3090 + CPU).
Why it matters
This article provides a detailed technical reference for setting up and optimizing a local LLM server, which can be useful for developers and researchers working with large language models.
Key Points
- 1Set up local AI server with llama.cpp and different LLM models
- 2Optimized configurations to get maximum performance on Dual RTX 3090 + CPU
- 3Shared llama-swap configuration as a reference for others
- 4Tested models like Seed-OSS, Qwen3-Coder, Devstral, Nemotron-3, GPT-OSS, GLM-4.5, GLM-4.6
Details
The author has shared their experience of setting up a local AI server with various large language models (LLMs) to replace Ollama and squeeze as much performance as possible from their high-end hardware (Dual RTX 3090 + CPU). They have provided a detailed llama-swap configuration file with the specific commands and options used for different models, including Seed-OSS, Qwen3-Coder, Devstral, Nemotron-3, GPT-OSS, GLM-4.5, and GLM-4.6. The goal was to optimize the performance and context size to fit within the 48GB VRAM available on the dual RTX 3090 GPUs. The author has shared this configuration as a reference for others who want to set up a similar local LLM server.
No comments yet
Be the first to comment