Building a Roguelike RPG with On-Device AI
The author optimized an on-device AI model to generate content for their roguelike RPG, achieving a 79x performance improvement over the initial approach.
Why it matters
Optimizing on-device AI inference is critical for building responsive, offline-capable applications like this roguelike RPG.
Key Points
- 1Tested different quantization levels on the Adreno OpenCL backend, finding that Q8_0 was the optimal setting
- 2Swapped to a smaller 1.7B parameter model (Qwen3-1.7B) which doubled the generation speed to 16.6 tokens/second
- 3The smaller model also better followed the prompts, generating appropriate mob names instead of placeholders
Details
The author started by using ONNX Runtime on the CPU, which was very slow at 0.21 tokens/second. They then tried ONNX Runtime with Qualcomm's QNN HTP, which improved it to 0.31 tokens/second. Next, they switched to the llama.cpp library with Adreno OpenCL, which boosted performance to 9.0 tokens/second using the larger Phi-4-mini model. Finally, by swapping to the smaller 1.7B parameter Qwen3-1.7B model, they achieved 16.6 tokens/second, a 79x improvement over the initial approach. This on-device AI stack can generate content for the game's dungeon loading screens in just 9 seconds.
No comments yet
Be the first to comment