New local realistic and emotional TTS with speeds up to 100x realtime: MiraTTS
The author has open-sourced MiraTTS, a fast and high-quality text-to-speech model that can generate realistic 48kHz audio at up to 100x realtime speed with low latency and low VRAM usage.
Why it matters
MiraTTS represents a significant advancement in text-to-speech technology, offering high-quality, low-latency, and efficient local inference, which could enable new real-time voice applications.
Key Points
- 1MiraTTS is an open-source TTS model that can generate realistic 48kHz audio at up to 100x realtime speed
- 2It has low latency (as low as 150ms) making it suitable for real-time streaming and voice agents
- 3The model has low VRAM usage (6GB) so it can run on low-end devices
- 4The author plans to release training code and experiment with multilingual and multi-speaker versions
Details
MiraTTS is a text-to-speech (TTS) model developed by the author that can generate high-quality 48kHz audio at incredibly fast speeds, up to 100x realtime. This is a significant improvement over other local TTS models that typically generate lower quality 16kHz or 24kHz audio. The model also has very low latency, as low as 150ms, making it suitable for real-time applications like voice agents and streaming. Additionally, MiraTTS has low VRAM usage of just 6GB, allowing it to run on low-end devices. The author plans to release the training code and experiment with multilingual and multi-speaker versions of the model in the future.
No comments yet
Be the first to comment