Xinyu Wang1, Chongbo Zhao1, Fangneng Zhan2, Yue Ma2
1THU 2HKUST
Accepted by ECCV 2026
- [2026.06.24] Release README, inference scripts, and Hugging Face checkpoint instructions.
- [2026.06.24] Release inference and training code.
LiveEdit is a diffusion-based framework for streaming video editing. Given a source video and a text editing instruction, LiveEdit performs causal chunk-by-chunk editing while preserving backgrounds and non-edited regions.
- Real-time-oriented video editing with causal chunk-by-chunk inference.
- Strong source preservation for backgrounds and non-edited regions.
- Three-stage distillation from a bidirectional editing teacher to a streaming student.
- AR-oriented Mask Cache for efficient region-aware computation reuse.
- Built on Wan2.1 and the Self-Forcing codebase.
We recommend Linux with NVIDIA GPUs. Single-GPU inference is supported; training scripts are written for multi-GPU torchrun.
conda create -n liveedit python=3.10 -y
conda activate liveedit
pip install -r requirements.txt
pip install flash-attn --no-build-isolationDownload the Wan2.1 base model:
huggingface-cli download Wan-AI/Wan2.1-T2V-1.3B \
--local-dir-use-symlinks False \
--local-dir wan_models/Wan2.1-T2V-1.3BDownload the released LiveEdit checkpoint:
mkdir -p checkpoints/liveedit
huggingface-cli download cp-cp/LiveEdit ar-forcing_002000.pt \
--local-dir checkpoints/liveeditThe released checkpoint should be organized as:
checkpoints/
βββ liveedit/
βββ ar-forcing_002000.pt
wan_models/
βββ Wan2.1-T2V-1.3B/
ar-forcing_002000.pt corresponds to the 2000-step self-forcing checkpoint used by infer-local-ar-forcing.sh.
For video-to-video editing, prepare a JSON file with source videos and text instructions:
[
{
"instruction": "Change the red currants to deep black grapes.",
"source_path": "./test_cases/test.mp4"
}
]Example inputs are provided in test_cases/test.json and test_cases/test-long.json.
Run the default LiveEdit inference script:
bash infer-local-ar-forcing.shEquivalent command:
CUDA_VISIBLE_DEVICES=0 python inference-mm.py \
--config_path configs/wan_mm-ar-forcing-local.yaml \
--output_folder videos/test \
--checkpoint_path checkpoints/liveedit/ar-forcing_002000.pt \
--data_path test_cases/test.json \
--num_output_frames 21 \
--task v2v \
--inference_num_steps 50The AR-oriented Mask Cache in the paper is exposed through the token-pruning inference config and helper script. It reuses computation in unchanged regions and can optionally save mask visualizations.
bash infer-token-pruning.shEquivalent command:
CUDA_VISIBLE_DEVICES=0 python inference-mm.py \
--config_path configs/wan_mm-token-pruning.yaml \
--output_folder videos/mask-cache-test \
--checkpoint_path checkpoints/liveedit/ar-forcing_002000.pt \
--data_path test_cases/test.json \
--num_output_frames 21 \
--prefix "mask_cache_" \
--task v2v \
--save_mask--save_mask saves visualizations of the reused and fully computed regions to the output folder.
LiveEdit uses a three-stage training pipeline:
- Foundation Tuning for Editing Ability Acquisition: trains a strong offline video editing model.
- Teacher Forcing for Chunk-wise Causal Initial: adapts the model to causal chunk-wise editing.
- DMD for Streaming Video Editing: compresses streaming inference to a small number of denoising steps.
Example entry points:
bash train-mm-bid-diffusion.sh
bash train-mm-ar-diffusion.sh
bash train-mm-ar-forcing.shBefore training, update the config paths for your dataset, Wan2.1 model location, and stage checkpoints.
This repository builds on Self-Forcing, CausVid, and Wan2.1. We thank the authors for their open-source contributions.
If you find this project useful for your research, please cite:
@article{wang2026liveedit,
title={LiveEdit: Towards Real-Time Diffusion-Based Streaming Video Editing},
author={Wang, Xinyu and Zhao, Chongbo and Zhan, Fangneng and Ma, Yue},
journal={arXiv preprint arXiv:2606.26740},
year={2026}
}