Skip to content

Remove openai-whisper dependency for log_mel_spectrogram#1846

Open
musselmanjoey wants to merge 1 commit intoFunAudioLLM:mainfrom
musselmanjoey:remove-whisper-dependency
Open

Remove openai-whisper dependency for log_mel_spectrogram#1846
musselmanjoey wants to merge 1 commit intoFunAudioLLM:mainfrom
musselmanjoey:remove-whisper-dependency

Conversation

@musselmanjoey
Copy link
Copy Markdown

Summary

  • Replace all whisper.log_mel_spectrogram() calls with a lightweight implementation using torch + torchaudio (already required deps)
  • Add cosyvoice/utils/audio_utils.py with a drop-in log_mel_spectrogram() function
  • Remove openai-whisper==20231117 from requirements.txt
  • Move the legacy whisper.tokenizer.Tokenizer import (used only by CosyVoice v1's get_tokenizer()) to a lazy import so it doesn't break module loading

Motivation

openai-whisper is a ~1.5GB speech recognition package, but CosyVoice only uses one utility function from it: whisper.log_mel_spectrogram(). This causes widespread installation failures due to dependency conflicts, especially on platforms with pre-installed PyTorch (Kaggle, Colab, etc).

Related issues: #1844, #1266, #249, #316

Details

log_mel_spectrogram is a standard audio preprocessing operation (STFT → mel filterbank → log scaling). The replacement in audio_utils.py uses torch.stft and torchaudio.functional.melscale_fbanks with the same parameters as Whisper (n_fft=400, hop_length=160, 16kHz sample rate), producing numerically equivalent output.

Files changed:

  • cosyvoice/utils/audio_utils.py (new) — shared log_mel_spectrogram implementation
  • cosyvoice/cli/frontend.py — use audio_utils.log_mel_spectrogram instead of whisper
  • cosyvoice/dataset/processor.py — same replacement
  • tools/extract_speech_token.py — same replacement
  • cosyvoice/tokenizer/tokenizer.py — lazy import of whisper.tokenizer.Tokenizer (only needed for v1 tokenizer path)
  • requirements.txt — remove openai-whisper

🤖 Generated with Claude Code

openai-whisper is a heavy (~1.5GB) speech recognition package but
CosyVoice only uses whisper.log_mel_spectrogram() — a standard
audio preprocessing utility. This causes widespread installation
failures (see FunAudioLLM#1844, FunAudioLLM#1266, FunAudioLLM#249, FunAudioLLM#316) due to dependency conflicts,
especially on platforms with pre-installed PyTorch (Kaggle, Colab).

Replace all whisper.log_mel_spectrogram() calls with a lightweight
implementation in cosyvoice/utils/audio_utils.py that uses only
torch and torchaudio (already required dependencies). The output
is numerically equivalent.

The legacy get_tokenizer() function (CosyVoice v1) still needs
whisper.tokenizer.Tokenizer, so that import is moved to a lazy
import inside the function body — it only triggers if you actually
use the v1 tokenizer path. CosyVoice2/3 tokenizers are unaffected.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

1 participant