Dev.to Machine Learning5d ago|Research & PapersProducts & Services

Whisper Hallucination on Silence: Why Your Transcript Loops the Same Phrase

This article discusses the issue of 'hallucination' in automatic speech recognition (ASR) models like Whisper, where the model repeatedly outputs the same phrase when encountering silence or background noise in the audio.

💡

Why it matters

Addressing the hallucination issue is crucial for improving the reliability and accuracy of automatic speech recognition systems in real-world applications.

Key Points

  • 1Silence or low-confidence audio segments cause the model to 'fill in' with the most recent phrase it recognized
  • 2Background music or ambient noise also trigger the model to loop phrases, as it struggles with non-speech audio
  • 3Whisper and similar models lack a built-in voice activity detection (VAD) mechanism to skip silent/background-only segments

Details

The article explains that when there is silence or very quiet audio, the model's audio embeddings are near-zero, and it tries to 'fill in' what it thinks should be there by looping the most recent phrase it recognized. Similarly, background music or ambient noise causes the model to see 'something happening' and attempt to match it to its speech-based training data, leading to phrase looping. The proper solution is to add voice activity detection (VAD) before transcription, which would skip silence and background-only segments entirely, preventing the hallucination issue. The article also mentions that models like Sarvam AI's Saarika/Saaras, which are purpose-built for Indian audio patterns, handle background noise and hallucination better than general models like Whisper.

Like
Save
Read original
Cached
Comments
?

No comments yet

Be the first to comment

AI Curator - Daily AI News Curation

AI Curator

Your AI news assistant

Ask me anything about AI

I can help you understand AI news, trends, and technologies