🎙️ VibeVoice Realtime Runner

A high-performance local runner for Microsoft's VibeVoice Realtime text-to-speech model. Now with OpenAI-compatible API endpoints!

Features • Quick Start • API Documentation • Credits

🚀 Features

Local & Private: Runs entirely on your machine (CUDA/MPS/CPU).
Realtime Streaming: Low-latency text-to-speech generation.
FlashSR Super-Resolution: Ultra-fast audio upsampling (24kHz → 48kHz) at 200-400x realtime, enabled by default.
OpenAI API Compatible: Drop-in replacement for OpenAI's TTS API.
Multiple Audio Formats: Supports Opus (default), WAV, and MP3 output.
Web Interface: Built-in interactive demo UI.
Multi-Platform: Optimized for Ubuntu (CUDA) and macOS (Apple Silicon).
Easy Setup: Powered by uv for fast, reliable dependency management.

⚡ Quick Start

Prerequisites

uv installed: curl -LsSf https://astral.sh/uv/install.sh | sh
Git
Hugging Face Account (for model download)

Installation

Bootstrap the environment:
```
./scripts/bootstrap_uv.sh
```
Download the model:
```
uv run python scripts/download_model.py
```
Run the server:
```
uv run python scripts/run_realtime_demo.py --port 8000
```
- Web UI: Open http://127.0.0.1:8000
- API: http://127.0.0.1:8000/v1/audio/speech

📖 API Documentation

This runner provides OpenAI-compatible endpoints for easy integration with existing tools and libraries.

🗣️ Speech Generation

Endpoint: POST /v1/audio/speech

Generates audio from text with FlashSR super-resolution enabled by default (24kHz → 48kHz).

curl http://127.0.0.1:8000/v1/audio/speech \
  -H "Content-Type: application/json" \
  -d '{
    "model": "tts-1",
    "input": "Hello, this is VibeVoice running locally!",
    "voice": "en-Carter_man",
    "response_format": "opus"
  }' \
  --output speech.opus

Parameter	Type	Description
`model`	`string`	Model identifier (e.g., `tts-1`). Ignored but required for compatibility.
`input`	`string`	The text to generate audio for.
`voice`	`string`	The voice ID to use (see `/v1/audio/voices`).
`response_format`	`string`	Output format: `opus` (default, 48kHz), `wav`, or `mp3`.
`speed`	`float`	Speed of generation (currently ignored).

🎤 List Voices

Endpoint: GET /v1/audio/voices

Returns a list of available voices.

curl http://127.0.0.1:8000/v1/audio/voices

Response:

{
  "voices": [
    {
      "id": "en-Carter_man",
      "name": "en-Carter_man",
      "object": "voice",
      "category": "vibe_voice",
      ...
    },
    ...
  ]
}

⚙️ Configuration

Device Selection

The runner automatically detects the best available device:

CUDA: NVIDIA GPUs (Linux)
MPS: Apple Silicon (macOS)
CPU: Fallback

To force a specific device:

uv run python scripts/run_realtime_demo.py --device cpu

Inference Steps

Specify the number of DDPM inference steps. Higher values (e.g., 15-20) improve quality but increase latency. The default is 15.

uv run python scripts/run_realtime_demo.py --inference-steps 15

Custom Model Path

uv run python scripts/run_realtime_demo.py --model-path /path/to/model

FlashSR Audio Super-Resolution

FlashSR is enabled by default to upsample audio from 24kHz to 48kHz at 200-400x realtime speed. This provides higher quality audio output with minimal performance impact.

To disable FlashSR (output will be 24kHz):

export ENABLE_FLASHSR=false
uv run python scripts/run_realtime_demo.py

Or enable it explicitly:

export ENABLE_FLASHSR=true
uv run python scripts/run_realtime_demo.py

Benefits of FlashSR:

Ultra-fast processing (200-400x realtime)
Higher quality 48kHz audio output
Lightweight model (~2MB)
Compatible with Opus format for optimal compression

🎧 Demos

All examples generated using 15 inference steps with text in the voice's native language.

English

Voice	Audio Example (MP3)
en-Carter_man
en-Davis_man
en-Emma_woman
en-Frank_man
en-Grace_woman
en-Mike_man
in-Samuel_man

Other Languages

Language	Voice	Audio Example (MP3)
German	de-Spk0_man
German	de-Spk1_woman
Spanish	sp-Spk0_woman
Spanish	sp-Spk1_man
French	fr-Spk0_man
French	fr-Spk1_woman
Italian	it-Spk0_woman
Italian	it-Spk1_man
Japanese	jp-Spk0_man
Japanese	jp-Spk1_woman
Korean	kr-Spk0_woman
Korean	kr-Spk1_man
Dutch	nl-Spk0_man
Dutch	nl-Spk1_woman
Polish	pl-Spk0_man
Polish	pl-Spk1_woman
Portuguese	pt-Spk0_woman
Portuguese	pt-Spk1_man

🏆 Credits & Acknowledgements

This project stands on the shoulders of giants. Huge thanks to:

Microsoft: For releasing the incredible VibeVoice model and the original codebase.
groxaxo: For the original repository and initial setup.
Kokoro FastAPI Creators: For inspiration on the FastAPI implementation and structure.
Open Source Community: For all the tools and libraries that make this possible.

Made with ❤️ for the AI Community

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
docs/demos		docs/demos
overrides		overrides
scripts		scripts
.gitignore		.gitignore
.python-version		.python-version
README.md		README.md
main.py		main.py
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🎙️ VibeVoice Realtime Runner

🚀 Features

⚡ Quick Start

Prerequisites

Installation

📖 API Documentation

🗣️ Speech Generation

🎤 List Voices

⚙️ Configuration

Device Selection

Inference Steps

Custom Model Path

FlashSR Audio Super-Resolution

🎧 Demos

English

Other Languages

🏆 Credits & Acknowledgements

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

groxaxo/vibevoice-realtimeFASTAPI

Folders and files

Latest commit

History

Repository files navigation

🎙️ VibeVoice Realtime Runner

🚀 Features

⚡ Quick Start

Prerequisites

Installation

📖 API Documentation

🗣️ Speech Generation

🎤 List Voices

⚙️ Configuration

Device Selection

Inference Steps

Custom Model Path

FlashSR Audio Super-Resolution

🎧 Demos

English

Other Languages

🏆 Credits & Acknowledgements

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages