LiveEdit

Towards Real-Time Diffusion-Based Streaming Video Editing

Xinyu Wang¹, Chongbo Zhao¹, Fangneng Zhan², Yue Ma²

¹THU ²HKUST

Accepted by ECCV 2026

📣 Updates

[2026.06.24] Release README, inference scripts, and Hugging Face checkpoint instructions.
[2026.06.24] Release inference and training code.

🔍 Overview

LiveEdit is a diffusion-based framework for streaming video editing. Given a source video and a text editing instruction, LiveEdit performs causal chunk-by-chunk editing while preserving backgrounds and non-edited regions.

✨ Highlights

Real-time-oriented video editing with causal chunk-by-chunk inference.
Strong source preservation for backgrounds and non-edited regions.
Three-stage distillation from a bidirectional editing teacher to a streaming student.
AR-oriented Mask Cache for efficient region-aware computation reuse.
Built on Wan2.1 and the Self-Forcing codebase.

🛠 Getting Started

1. Clone the code and prepare the environment

We recommend Linux with NVIDIA GPUs. Single-GPU inference is supported; training scripts are written for multi-GPU torchrun.

conda create -n liveedit python=3.10 -y
conda activate liveedit
pip install -r requirements.txt
pip install flash-attn --no-build-isolation

2. Download pretrained weights

Download the Wan2.1 base model:

huggingface-cli download Wan-AI/Wan2.1-T2V-1.3B \
  --local-dir-use-symlinks False \
  --local-dir wan_models/Wan2.1-T2V-1.3B

Download the released LiveEdit checkpoint:

mkdir -p checkpoints/liveedit
huggingface-cli download cp-cp/LiveEdit ar-forcing_002000.pt \
  --local-dir checkpoints/liveedit

The released checkpoint should be organized as:

checkpoints/
└── liveedit/
    └── ar-forcing_002000.pt

wan_models/
└── Wan2.1-T2V-1.3B/

ar-forcing_002000.pt corresponds to the 2000-step self-forcing checkpoint used by infer-local-ar-forcing.sh.

3. Prepare input videos

For video-to-video editing, prepare a JSON file with source videos and text instructions:

[
  {
    "instruction": "Change the red currants to deep black grapes.",
    "source_path": "./test_cases/test.mp4"
  }
]

Example inputs are provided in test_cases/test.json and test_cases/test-long.json.

4. Inference

Run the default LiveEdit inference script:

bash infer-local-ar-forcing.sh

Equivalent command:

CUDA_VISIBLE_DEVICES=0 python inference-mm.py \
  --config_path configs/wan_mm-ar-forcing-local.yaml \
  --output_folder videos/test \
  --checkpoint_path checkpoints/liveedit/ar-forcing_002000.pt \
  --data_path test_cases/test.json \
  --num_output_frames 21 \
  --task v2v \
  --inference_num_steps 50

🚀 Efficient Inference with AR-Oriented Mask Cache

The AR-oriented Mask Cache in the paper is exposed through the token-pruning inference config and helper script. It reuses computation in unchanged regions and can optionally save mask visualizations.

bash infer-token-pruning.sh

Equivalent command:

CUDA_VISIBLE_DEVICES=0 python inference-mm.py \
  --config_path configs/wan_mm-token-pruning.yaml \
  --output_folder videos/mask-cache-test \
  --checkpoint_path checkpoints/liveedit/ar-forcing_002000.pt \
  --data_path test_cases/test.json \
  --num_output_frames 21 \
  --prefix "mask_cache_" \
  --task v2v \
  --save_mask

--save_mask saves visualizations of the reused and fully computed regions to the output folder.

⚙️ Training

LiveEdit uses a three-stage training pipeline:

Foundation Tuning for Editing Ability Acquisition: trains a strong offline video editing model.
Teacher Forcing for Chunk-wise Causal Initial: adapts the model to causal chunk-wise editing.
DMD for Streaming Video Editing: compresses streaming inference to a small number of denoising steps.

Example entry points:

bash train-mm-bid-diffusion.sh
bash train-mm-ar-diffusion.sh
bash train-mm-ar-forcing.sh

Before training, update the config paths for your dataset, Wan2.1 model location, and stage checkpoints.

👍 Acknowledgements

This repository builds on Self-Forcing, CausVid, and Wan2.1. We thank the authors for their open-source contributions.

Citation 💖

If you find this project useful for your research, please cite:

@article{wang2026liveedit,
  title={LiveEdit: Towards Real-Time Diffusion-Based Streaming Video Editing},
  author={Wang, Xinyu and Zhao, Chongbo and Zhan, Fangneng and Ma, Yue},
  journal={arXiv preprint arXiv:2606.26740},
  year={2026}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LiveEdit

Towards Real-Time Diffusion-Based Streaming Video Editing

📣 Updates

🔍 Overview

✨ Highlights

🛠 Getting Started

1. Clone the code and prepare the environment

2. Download pretrained weights

3. Prepare input videos

4. Inference

🚀 Efficient Inference with AR-Oriented Mask Cache

⚙️ Training

👍 Acknowledgements

Citation 💖

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
asserts		asserts
configs		configs
demo_utils		demo_utils
model		model
pipeline		pipeline
scripts		scripts
test_cases		test_cases
trainer		trainer
utils		utils
wan		wan
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
infer-local-ar-forcing.sh		infer-local-ar-forcing.sh
infer-token-pruning.sh		infer-token-pruning.sh
inference-mm-diffusion-pipeline.py		inference-mm-diffusion-pipeline.py
inference-mm.py		inference-mm.py
requirements.txt		requirements.txt
train-mm-ar-diffusion.sh		train-mm-ar-diffusion.sh
train-mm-ar-forcing.sh		train-mm-ar-forcing.sh
train-mm-bid-diffusion.sh		train-mm-bid-diffusion.sh
train.py		train.py

Folders and files

Latest commit

History

Repository files navigation

LiveEdit

Towards Real-Time Diffusion-Based Streaming Video Editing

📣 Updates

🔍 Overview

✨ Highlights

🛠 Getting Started

1. Clone the code and prepare the environment

2. Download pretrained weights

3. Prepare input videos

4. Inference

🚀 Efficient Inference with AR-Oriented Mask Cache

⚙️ Training

👍 Acknowledgements

Citation 💖

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages