LiveEdit

Towards Real-Time Diffusion-Based Streaming Video Editing

Xinyu Wang1, Chongbo Zhao1, Fangneng Zhan2, Yue Ma2

1THU    2HKUST

Accepted by ECCV 2026

πŸ“£ Updates

  • [2026.06.24] Release README, inference scripts, and Hugging Face checkpoint instructions.
  • [2026.06.24] Release inference and training code.

πŸ” Overview

LiveEdit is a diffusion-based framework for streaming video editing. Given a source video and a text editing instruction, LiveEdit performs causal chunk-by-chunk editing while preserving backgrounds and non-edited regions.

✨ Highlights

  • Real-time-oriented video editing with causal chunk-by-chunk inference.
  • Strong source preservation for backgrounds and non-edited regions.
  • Three-stage distillation from a bidirectional editing teacher to a streaming student.
  • AR-oriented Mask Cache for efficient region-aware computation reuse.
  • Built on Wan2.1 and the Self-Forcing codebase.

πŸ›  Getting Started

1. Clone the code and prepare the environment

We recommend Linux with NVIDIA GPUs. Single-GPU inference is supported; training scripts are written for multi-GPU torchrun.

conda create -n liveedit python=3.10 -y
conda activate liveedit
pip install -r requirements.txt
pip install flash-attn --no-build-isolation

2. Download pretrained weights

Download the Wan2.1 base model:

huggingface-cli download Wan-AI/Wan2.1-T2V-1.3B \
  --local-dir-use-symlinks False \
  --local-dir wan_models/Wan2.1-T2V-1.3B

Download the released LiveEdit checkpoint:

mkdir -p checkpoints/liveedit
huggingface-cli download cp-cp/LiveEdit ar-forcing_002000.pt \
  --local-dir checkpoints/liveedit

The released checkpoint should be organized as:

checkpoints/
└── liveedit/
    └── ar-forcing_002000.pt

wan_models/
└── Wan2.1-T2V-1.3B/

ar-forcing_002000.pt corresponds to the 2000-step self-forcing checkpoint used by infer-local-ar-forcing.sh.

3. Prepare input videos

For video-to-video editing, prepare a JSON file with source videos and text instructions:

[
  {
    "instruction": "Change the red currants to deep black grapes.",
    "source_path": "./test_cases/test.mp4"
  }
]

Example inputs are provided in test_cases/test.json and test_cases/test-long.json.

4. Inference

Run the default LiveEdit inference script:

bash infer-local-ar-forcing.sh

Equivalent command:

CUDA_VISIBLE_DEVICES=0 python inference-mm.py \
  --config_path configs/wan_mm-ar-forcing-local.yaml \
  --output_folder videos/test \
  --checkpoint_path checkpoints/liveedit/ar-forcing_002000.pt \
  --data_path test_cases/test.json \
  --num_output_frames 21 \
  --task v2v \
  --inference_num_steps 50

πŸš€ Efficient Inference with AR-Oriented Mask Cache

The AR-oriented Mask Cache in the paper is exposed through the token-pruning inference config and helper script. It reuses computation in unchanged regions and can optionally save mask visualizations.

bash infer-token-pruning.sh

Equivalent command:

CUDA_VISIBLE_DEVICES=0 python inference-mm.py \
  --config_path configs/wan_mm-token-pruning.yaml \
  --output_folder videos/mask-cache-test \
  --checkpoint_path checkpoints/liveedit/ar-forcing_002000.pt \
  --data_path test_cases/test.json \
  --num_output_frames 21 \
  --prefix "mask_cache_" \
  --task v2v \
  --save_mask

--save_mask saves visualizations of the reused and fully computed regions to the output folder.

βš™οΈ Training

LiveEdit uses a three-stage training pipeline:

  1. Foundation Tuning for Editing Ability Acquisition: trains a strong offline video editing model.
  2. Teacher Forcing for Chunk-wise Causal Initial: adapts the model to causal chunk-wise editing.
  3. DMD for Streaming Video Editing: compresses streaming inference to a small number of denoising steps.

Example entry points:

bash train-mm-bid-diffusion.sh
bash train-mm-ar-diffusion.sh
bash train-mm-ar-forcing.sh

Before training, update the config paths for your dataset, Wan2.1 model location, and stage checkpoints.

πŸ‘ Acknowledgements

This repository builds on Self-Forcing, CausVid, and Wan2.1. We thank the authors for their open-source contributions.

Citation πŸ’–

If you find this project useful for your research, please cite:

@article{wang2026liveedit,
  title={LiveEdit: Towards Real-Time Diffusion-Based Streaming Video Editing},
  author={Wang, Xinyu and Zhao, Chongbo and Zhan, Fangneng and Ma, Yue},
  journal={arXiv preprint arXiv:2606.26740},
  year={2026}
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for cp-cp/LiveEdit

Finetuned
(57)
this model

Dataset used to train cp-cp/LiveEdit

Space using cp-cp/LiveEdit 1

Paper for cp-cp/LiveEdit