LocalLLaMA Reddit3h ago|プロダクト・サービスチュートリアル

Spent weekend tuning LLM server to hone my nerdism

The author spent time setting up a local AI server with various large language models (LLMs) for chat and coding tasks, aiming to replace Ollama with llama.cpp and maximize performance on their hardware (Dual RTX 3090 + CPU).

💡

Why it matters

This article provides a detailed technical reference for setting up and optimizing a local LLM server, which can be useful for developers and researchers working with large language models.

Key Points

1Set up local AI server with llama.cpp and different LLM models
2Optimized configurations to get maximum performance on Dual RTX 3090 + CPU
3Shared llama-swap configuration as a reference for others
4Tested models like Seed-OSS, Qwen3-Coder, Devstral, Nemotron-3, GPT-OSS, GLM-4.5, GLM-4.6

Details

The author has shared their experience of setting up a local AI server with various large language models (LLMs) to replace Ollama and squeeze as much performance as possible from their high-end hardware (Dual RTX 3090 + CPU). They have provided a detailed llama-swap configuration file with the specific commands and options used for different models, including Seed-OSS, Qwen3-Coder, Devstral, Nemotron-3, GPT-OSS, GLM-4.5, and GLM-4.6. The goal was to optimize the performance and context size to fit within the 48GB VRAM available on the dual RTX 3090 GPUs. The author has shared this configuration as a reference for others who want to set up a similar local LLM server.

Spent weekend tuning LLM server to hone my nerdism

Why it matters

Key Points

Details

Dive deeper

Related Articles

Kimi K2 Thinking is the least sycophantic open-source AI, a…

upstage/Solar-Open-100B · Hugging Face

Qwen3-235B-W4A16 is S tier

Local RAG with small models with hallucination mitigation

Will the RAM shortage cause hike in hosting and compute cos…

GLM 4.7 Frontend tests (Source: Chinese Forum）

Jan-v2-VL-Max: A 30B multimodal model outperforming Gemini …

Chroma DB's weak Open Source commitment

100% Java RAG Engine Runs on <500MB RAM

1.8× peak throughput for Kimi K2 with EAGLE3 draft model

AI Curator

Ask me anything about AI

Related Articles

Kimi K2 Thinking is the least sycophantic open-source AI, a…

upstage/Solar-Open-100B · Hugging Face

Local RAG with small models with hallucination mitigation

Will the RAM shortage cause hike in hosting and compute cos…

GLM 4.7 Frontend tests (Source: Chinese Forum）

Jan-v2-VL-Max: A 30B multimodal model outperforming Gemini …

Chroma DB's weak Open Source commitment

100% Java RAG Engine Runs on <500MB RAM

1.8× peak throughput for Kimi K2 with EAGLE3 draft model