vLLM On-Demand Gateway: Zero-VRAM Standby for Local LLMs on Consumer GPUs
The article presents a solution to the problem of local large language models (LLMs) like vLLM hogging GPU VRAM even when not in use. The author developed a FastAPI gateway that automatically starts and stops the vLLM process on-demand, freeing up VRAM for other applications.
Why it matters
This solution helps address the resource management challenges of running local LLMs on consumer GPUs, enabling more efficient use of limited VRAM.
Key Points
- 1The gateway listens on port 8000 with near-zero VRAM usage
- 2It automatically starts the vLLM process on an internal port (8100) when a request arrives
- 3It automatically stops the vLLM process after 10 minutes of idle time, fully freeing the VRAM
- 4It rewrites tool calls from Nemotron's format to OpenAI-compatible format
Details
The article describes the problem of running a local LLM like vLLM on a consumer GPU, where the LLM process claims a large portion of the VRAM and never releases it, even when not in use. This can cause issues for other GPU-accelerated applications that need to share the limited VRAM. The author's solution is a FastAPI gateway that manages the lifecycle of the vLLM process. The gateway listens on port 8000 with minimal VRAM usage, and automatically starts the vLLM process on an internal port (8100) when a request arrives. It then automatically stops the vLLM process after 10 minutes of idle time, fully freeing the VRAM. The gateway also rewrites the tool call format from Nemotron's custom format to the OpenAI-compatible format, making the solution transparent to the client. The key design decisions include using a process group kill to ensure no zombie processes are left behind, separating the gateway and vLLM ports to avoid conflicts, and polling the health check endpoint to ensure the vLLM process is ready before proxying requests.
No comments yet
Be the first to comment