HotSwap: Routing LLM Subtasks by Cache Economics
This article proposes HotSwap, a pattern that keeps a persistent cached Claude session as the stateful backbone while offloading read-only exploration turns to a cheaper provider to reduce LLM API costs.
Why it matters
HotSwap provides a way to reduce LLM API costs by leveraging prompt caching and model routing in a hybrid architecture.
Key Points
- 1HotSwap uses cache economics as the motivating insight for a hybrid architecture that keeps the primary session warm
- 2It has a guardrail mechanism that lets cheap models explore freely but prevents them from taking irreversible actions
- 3It uses a self-tuning model selector that promotes or demotes the exploration model based on observed outcomes
- 4It includes a cross-provider message format translation layer to provide a seamless conversation history
Details
HotSwap is a pattern that separates LLM usage into two channels based on task type, with cache economics as the motivating reason. The cached backbone is a persistent Claude session that handles all turns involving action, while the cheap sidecar is an OpenAI model that handles exploration turns. The sidecar receives the full message history translated to OpenAI's format to make good exploration decisions. HotSwap's contributions include using cache economics as the motivating insight, a guardrail mechanism to prevent cheap models from taking irreversible actions, a self-tuning model selector, and a cross-provider message format translation layer.
No comments yet
Be the first to comment