Whisper Hallucination on Silence: Why Your Transcript Loops the Same Phrase
This article discusses the issue of 'hallucination' in automatic speech recognition (ASR) models like Whisper, where the model repeatedly outputs the same phrase when encountering silence or background noise in the audio.
Why it matters
Addressing the hallucination issue is crucial for improving the reliability and accuracy of automatic speech recognition systems in real-world applications.
Key Points
- 1Silence or low-confidence audio segments cause the model to 'fill in' with the most recent phrase it recognized
- 2Background music or ambient noise also trigger the model to loop phrases, as it struggles with non-speech audio
- 3Whisper and similar models lack a built-in voice activity detection (VAD) mechanism to skip silent/background-only segments
Details
The article explains that when there is silence or very quiet audio, the model's audio embeddings are near-zero, and it tries to 'fill in' what it thinks should be there by looping the most recent phrase it recognized. Similarly, background music or ambient noise causes the model to see 'something happening' and attempt to match it to its speech-based training data, leading to phrase looping. The proper solution is to add voice activity detection (VAD) before transcription, which would skip silence and background-only segments entirely, preventing the hallucination issue. The article also mentions that models like Sarvam AI's Saarika/Saaras, which are purpose-built for Indian audio patterns, handle background noise and hallucination better than general models like Whisper.
No comments yet
Be the first to comment