Building a Voice-Controlled Local AI Agent with Streamlit, Local STT, and Safe Tool Execution
The author built a voice-controlled local AI agent using Streamlit, local speech-to-text, and a safe tool execution layer. The system accepts audio input, converts speech to text, understands user intent, and executes local tools in a clean UI.
Why it matters
This project demonstrates an end-to-end AI application that combines speech processing, intent understanding, and safe local tool execution in a transparent and user-friendly manner.
Key Points
- 1Uses local Hugging Face speech-to-text model with API fallback
- 2Supports multiple intent planning backends, including local rules-based and Ollama LLM
- 3Implements safe file operations restricted to a dedicated output folder
- 4Streamlit UI shows transcription, planned actions, and final output
Details
The project follows a local-first design, allowing users to either record audio or upload an existing file. The speech-to-text layer uses a local Hugging Face model by default, with an OpenAI API fallback option for weaker hardware. The intent planning module supports multiple backends, including a lightweight local rules-based planner and a stronger Ollama LLM. The safe tool execution layer maps intents to specific actions like file creation, code generation, and text summarization, all restricted to a dedicated output folder. The Streamlit UI displays the full pipeline, showing transcription, planned steps, and final results. Key challenges included balancing capability and reliability, as well as ensuring safe file operations.
No comments yet
Be the first to comment