A high-performance local runner for Microsoft's VibeVoice Realtime text-to-speech model. Now with OpenAI-compatible API endpoints!
Features β’ Quick Start β’ API Documentation β’ Credits
- Local & Private: Runs entirely on your machine (CUDA/MPS/CPU).
- Realtime Streaming: Low-latency text-to-speech generation.
- FlashSR Super-Resolution: Ultra-fast audio upsampling (24kHz β 48kHz) at 200-400x realtime, enabled by default.
- OpenAI API Compatible: Drop-in replacement for OpenAI's TTS API.
- Multiple Audio Formats: Supports Opus (default), WAV, and MP3 output.
- Web Interface: Built-in interactive demo UI.
- Multi-Platform: Optimized for Ubuntu (CUDA) and macOS (Apple Silicon).
- Easy Setup: Powered by
uvfor fast, reliable dependency management.
- uv installed:
curl -LsSf https://astral.sh/uv/install.sh | sh - Git
- Hugging Face Account (for model download)
-
Bootstrap the environment:
./scripts/bootstrap_uv.sh
-
Download the model:
uv run python scripts/download_model.py
-
Run the server:
uv run python scripts/run_realtime_demo.py --port 8000
- Web UI: Open http://127.0.0.1:8000
- API:
http://127.0.0.1:8000/v1/audio/speech
This runner provides OpenAI-compatible endpoints for easy integration with existing tools and libraries.
Endpoint: POST /v1/audio/speech
Generates audio from text with FlashSR super-resolution enabled by default (24kHz β 48kHz).
curl http://127.0.0.1:8000/v1/audio/speech \
-H "Content-Type: application/json" \
-d '{
"model": "tts-1",
"input": "Hello, this is VibeVoice running locally!",
"voice": "en-Carter_man",
"response_format": "opus"
}' \
--output speech.opus| Parameter | Type | Description |
|---|---|---|
model |
string |
Model identifier (e.g., tts-1). Ignored but required for compatibility. |
input |
string |
The text to generate audio for. |
voice |
string |
The voice ID to use (see /v1/audio/voices). |
response_format |
string |
Output format: opus (default, 48kHz), wav, or mp3. |
speed |
float |
Speed of generation (currently ignored). |
Endpoint: GET /v1/audio/voices
Returns a list of available voices.
curl http://127.0.0.1:8000/v1/audio/voicesResponse:
{
"voices": [
{
"id": "en-Carter_man",
"name": "en-Carter_man",
"object": "voice",
"category": "vibe_voice",
...
},
...
]
}The runner automatically detects the best available device:
- CUDA: NVIDIA GPUs (Linux)
- MPS: Apple Silicon (macOS)
- CPU: Fallback
To force a specific device:
uv run python scripts/run_realtime_demo.py --device cpuSpecify the number of DDPM inference steps. Higher values (e.g., 15-20) improve quality but increase latency. The default is 15.
uv run python scripts/run_realtime_demo.py --inference-steps 15uv run python scripts/run_realtime_demo.py --model-path /path/to/modelFlashSR is enabled by default to upsample audio from 24kHz to 48kHz at 200-400x realtime speed. This provides higher quality audio output with minimal performance impact.
To disable FlashSR (output will be 24kHz):
export ENABLE_FLASHSR=false
uv run python scripts/run_realtime_demo.pyOr enable it explicitly:
export ENABLE_FLASHSR=true
uv run python scripts/run_realtime_demo.pyBenefits of FlashSR:
- Ultra-fast processing (200-400x realtime)
- Higher quality 48kHz audio output
- Lightweight model (~2MB)
- Compatible with Opus format for optimal compression
All examples generated using 15 inference steps with text in the voice's native language.
| Voice | Audio Example (MP3) |
|---|---|
| en-Carter_man | |
| en-Davis_man | |
| en-Emma_woman | |
| en-Frank_man | |
| en-Grace_woman | |
| en-Mike_man | |
| in-Samuel_man |
| Language | Voice | Audio Example (MP3) |
|---|---|---|
| German | de-Spk0_man | |
| German | de-Spk1_woman | |
| Spanish | sp-Spk0_woman | |
| Spanish | sp-Spk1_man | |
| French | fr-Spk0_man | |
| French | fr-Spk1_woman | |
| Italian | it-Spk0_woman | |
| Italian | it-Spk1_man | |
| Japanese | jp-Spk0_man | |
| Japanese | jp-Spk1_woman | |
| Korean | kr-Spk0_woman | |
| Korean | kr-Spk1_man | |
| Dutch | nl-Spk0_man | |
| Dutch | nl-Spk1_woman | |
| Polish | pl-Spk0_man | |
| Polish | pl-Spk1_woman | |
| Portuguese | pt-Spk0_woman | |
| Portuguese | pt-Spk1_man |
This project stands on the shoulders of giants. Huge thanks to:
- Microsoft: For releasing the incredible VibeVoice model and the original codebase.
- groxaxo: For the original repository and initial setup.
- Kokoro FastAPI Creators: For inspiration on the FastAPI implementation and structure.
- Open Source Community: For all the tools and libraries that make this possible.