LocalLLaMA Reddit2d ago|研究・論文プロダクト・サービス

Optimizing Whisper with Faster-Whisper and Pyannote 4.0

The author rebuilt their legacy ASR pipeline using Faster-Whisper and Pyannote 4.0, achieving significant performance improvements. They faced challenges with PyTorch 2.8, cuDNN 9, and API changes, but were able to optimize the speaker-to-word alignment algorithm to linear time complexity.

💡

Why it matters

The author's work demonstrates how optimizing legacy ASR pipelines can lead to significant performance improvements, which is crucial for real-world applications.

Key Points

1Rebuilt legacy ASR pipeline using Faster-Whisper and Pyannote 4.0
2Faced issues with PyTorch 2.8, cuDNN 9, and API changes
3Optimized speaker-to-word alignment algorithm to linear time complexity
4Achieved 30-second processing time for test files on RTX 4000 Ada GPU

Details

The author was running an old WhisperX setup, which was starting to show its age due to an abandoned repo, old PyTorch, and memory leaks. They decided to rebuild the pipeline from scratch using Faster-Whisper (CTranslate2) and the new Pyannote 4.0.3 for diarization. However, they faced several challenges, including issues with PyTorch 2.8 and cuDNN 9 dependencies, API breaking changes in Pyannote 4.0, and dependency conflicts. To overcome these issues, the author had to manually build the environment layer by layer in Docker, set explicit library paths, and rewrite the speaker-to-word alignment algorithm to a linear scan O(N) instead of the original quadratic O(N*M) approach. The result is a service that can now process audio (transcription, diarization, and alignment) in around 30 seconds for test files, using an RTX 4000 Ada GPU with around 4GB of VRAM usage.

Optimizing Whisper with Faster-Whisper and Pyannote 4.0

Why it matters

Key Points

Details

Dive deeper

Related Articles

Proud of My 2x3090 + Spare 3060 Setup

Nemotron-Nano-30B: What settings are you getting good resul…

Built a free local voice dictation app using faster-whisper…

As 2025 wraps up, which local LLMs really mattered this yea…

Moore Threads Unveils The Lushan Gaming & Huashan AI GPUs: …

OpenAIに対する不満の声

llama.cpp appreciation post

Glm 4.6 vs devstral 2 123b

EGGROLL: Trained a Model Without Backprop, Found Better Gen…

Dataset Quality is Not Improving Much

AI Curator

Ask me anything about AI

Related Articles

Proud of My 2x3090 + Spare 3060 Setup

Nemotron-Nano-30B: What settings are you getting good resul…

Built a free local voice dictation app using faster-whisper…

As 2025 wraps up, which local LLMs really mattered this yea…

Moore Threads Unveils The Lushan Gaming & Huashan AI GPUs: …

EGGROLL: Trained a Model Without Backprop, Found Better Gen…

Dataset Quality is Not Improving Much