New local realistic and emotional TTS with speeds up to 100x realtime: MiraTTS

The author has open-sourced MiraTTS, a fast and high-quality text-to-speech model that can generate realistic 48kHz audio at up to 100x realtime speed with low latency and low VRAM usage.

💡

Why it matters

MiraTTS represents a significant advancement in text-to-speech technology, offering high-quality, low-latency, and efficient local inference, which could enable new real-time voice applications.

Key Points

  • 1MiraTTS is an open-source TTS model that can generate realistic 48kHz audio at up to 100x realtime speed
  • 2It has low latency (as low as 150ms) making it suitable for real-time streaming and voice agents
  • 3The model has low VRAM usage (6GB) so it can run on low-end devices
  • 4The author plans to release training code and experiment with multilingual and multi-speaker versions

Details

MiraTTS is a text-to-speech (TTS) model developed by the author that can generate high-quality 48kHz audio at incredibly fast speeds, up to 100x realtime. This is a significant improvement over other local TTS models that typically generate lower quality 16kHz or 24kHz audio. The model also has very low latency, as low as 150ms, making it suitable for real-time applications like voice agents and streaming. Additionally, MiraTTS has low VRAM usage of just 6GB, allowing it to run on low-end devices. The author plans to release the training code and experiment with multilingual and multi-speaker versions of the model in the future.

Like
Save
Read original
Cached
Comments
?

No comments yet

Be the first to comment

AI Curator - Daily AI News Curation

AI Curator

Your AI news assistant

Ask me anything about AI

I can help you understand AI news, trends, and technologies