Optimizing Whisper with Faster-Whisper and Pyannote 4.0
The author rebuilt their legacy ASR pipeline using Faster-Whisper and Pyannote 4.0, achieving significant performance improvements. They faced challenges with PyTorch 2.8, cuDNN 9, and API changes, but were able to optimize the speaker-to-word alignment algorithm to linear time complexity.
Why it matters
The author's work demonstrates how optimizing legacy ASR pipelines can lead to significant performance improvements, which is crucial for real-world applications.
Key Points
- 1Rebuilt legacy ASR pipeline using Faster-Whisper and Pyannote 4.0
- 2Faced issues with PyTorch 2.8, cuDNN 9, and API changes
- 3Optimized speaker-to-word alignment algorithm to linear time complexity
- 4Achieved 30-second processing time for test files on RTX 4000 Ada GPU
Details
The author was running an old WhisperX setup, which was starting to show its age due to an abandoned repo, old PyTorch, and memory leaks. They decided to rebuild the pipeline from scratch using Faster-Whisper (CTranslate2) and the new Pyannote 4.0.3 for diarization. However, they faced several challenges, including issues with PyTorch 2.8 and cuDNN 9 dependencies, API breaking changes in Pyannote 4.0, and dependency conflicts. To overcome these issues, the author had to manually build the environment layer by layer in Docker, set explicit library paths, and rewrite the speaker-to-word alignment algorithm to a linear scan O(N) instead of the original quadratic O(N*M) approach. The result is a service that can now process audio (transcription, diarization, and alignment) in around 30 seconds for test files, using an RTX 4000 Ada GPU with around 4GB of VRAM usage.
No comments yet
Be the first to comment