Spent weekend tuning LLM server to hone my nerdism

The author spent time setting up a local AI server with various large language models (LLMs) for chat and coding tasks, aiming to replace Ollama with llama.cpp and maximize performance on their hardware (Dual RTX 3090 + CPU).

💡

Why it matters

This article provides a detailed technical reference for setting up and optimizing a local LLM server, which can be useful for developers and researchers working with large language models.

Key Points

  • 1Set up local AI server with llama.cpp and different LLM models
  • 2Optimized configurations to get maximum performance on Dual RTX 3090 + CPU
  • 3Shared llama-swap configuration as a reference for others
  • 4Tested models like Seed-OSS, Qwen3-Coder, Devstral, Nemotron-3, GPT-OSS, GLM-4.5, GLM-4.6

Details

The author has shared their experience of setting up a local AI server with various large language models (LLMs) to replace Ollama and squeeze as much performance as possible from their high-end hardware (Dual RTX 3090 + CPU). They have provided a detailed llama-swap configuration file with the specific commands and options used for different models, including Seed-OSS, Qwen3-Coder, Devstral, Nemotron-3, GPT-OSS, GLM-4.5, and GLM-4.6. The goal was to optimize the performance and context size to fit within the 48GB VRAM available on the dual RTX 3090 GPUs. The author has shared this configuration as a reference for others who want to set up a similar local LLM server.

Like
Save
Read original
Cached
Comments
?

No comments yet

Be the first to comment

AI Curator - Daily AI News Curation

AI Curator

Your AI news assistant

Ask me anything about AI

I can help you understand AI news, trends, and technologies