Dual-engine approach for detecting AI-generated music in compressed audio
The author explores a hybrid approach to detect AI-generated music, combining a CNN-based model and a source separation engine, to overcome the limitations of CNN-only models on compressed audio formats like MP3.
Why it matters
This hybrid approach offers a more robust solution for detecting AI-generated audio content, which is crucial as AI-generated media becomes more prevalent.
Key Points
- 1CNN-based detection on mel-spectrograms breaks when audio is compressed to MP3
- 2Combining a CNN model with a source separation engine (Demucs) can achieve 80%+ detection rate on AI-generated music
- 3The hybrid approach works regardless of audio codec (MP3, AAC, OGG) and saves compute by only using the expensive source separation when the CNN is uncertain
Details
The author was working on detecting AI-generated music and faced the same issue as Deezer's team - CNN-based models trained on mel-spectrograms perform well on uncompressed WAV files but break down when the audio is compressed to MP3. To address this, the author added a second engine based on source separation using the Demucs model. The idea is to separate the audio into 4 stems (vocals, drums, bass, other), remix them, and measure the difference between the original and reconstructed audio. For human-recorded music, the stems bleed into each other during recording, so the reconstruction process produces noticeable differences. For AI-generated music, where each stem is synthesized independently, the reconstruction yields nearly identical results. This hybrid approach achieved a human false positive rate of ~1.1% and an AI detection rate of 80%+, working across different audio codecs. The limitations include varying detection rates across different AI generators, non-deterministic behavior of Demucs in borderline cases, and only being tested on music (not speech or sound effects).
No comments yet
Be the first to comment