Remove openai-whisper dependency for log_mel_spectrogram by musselmanjoey · Pull Request #1846 · FunAudioLLM/CosyVoice

musselmanjoey · 2026-03-12T01:12:03Z

Summary

Replace all whisper.log_mel_spectrogram() calls with a lightweight implementation using torch + torchaudio (already required deps)
Add cosyvoice/utils/audio_utils.py with a drop-in log_mel_spectrogram() function
Remove openai-whisper==20231117 from requirements.txt
Move the legacy whisper.tokenizer.Tokenizer import (used only by CosyVoice v1's get_tokenizer()) to a lazy import so it doesn't break module loading

Motivation

openai-whisper is a ~1.5GB speech recognition package, but CosyVoice only uses one utility function from it: whisper.log_mel_spectrogram(). This causes widespread installation failures due to dependency conflicts, especially on platforms with pre-installed PyTorch (Kaggle, Colab, etc).

Related issues: #1844, #1266, #249, #316

Details

log_mel_spectrogram is a standard audio preprocessing operation (STFT → mel filterbank → log scaling). The replacement in audio_utils.py uses torch.stft and torchaudio.functional.melscale_fbanks with the same parameters as Whisper (n_fft=400, hop_length=160, 16kHz sample rate), producing numerically equivalent output.

Files changed:

cosyvoice/utils/audio_utils.py (new) — shared log_mel_spectrogram implementation
cosyvoice/cli/frontend.py — use audio_utils.log_mel_spectrogram instead of whisper
cosyvoice/dataset/processor.py — same replacement
tools/extract_speech_token.py — same replacement
cosyvoice/tokenizer/tokenizer.py — lazy import of whisper.tokenizer.Tokenizer (only needed for v1 tokenizer path)
requirements.txt — remove openai-whisper

🤖 Generated with Claude Code

openai-whisper is a heavy (~1.5GB) speech recognition package but CosyVoice only uses whisper.log_mel_spectrogram() — a standard audio preprocessing utility. This causes widespread installation failures (see FunAudioLLM#1844, FunAudioLLM#1266, FunAudioLLM#249, FunAudioLLM#316) due to dependency conflicts, especially on platforms with pre-installed PyTorch (Kaggle, Colab). Replace all whisper.log_mel_spectrogram() calls with a lightweight implementation in cosyvoice/utils/audio_utils.py that uses only torch and torchaudio (already required dependencies). The output is numerically equivalent. The legacy get_tokenizer() function (CosyVoice v1) still needs whisper.tokenizer.Tokenizer, so that import is moved to a lazy import inside the function body — it only triggers if you actually use the v1 tokenizer path. CosyVoice2/3 tokenizers are unaffected. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove openai-whisper dependency for log_mel_spectrogram#1846

Remove openai-whisper dependency for log_mel_spectrogram#1846
musselmanjoey wants to merge 1 commit intoFunAudioLLM:mainfrom
musselmanjoey:remove-whisper-dependency

musselmanjoey commented Mar 12, 2026

Labels

1 participant

Conversation