What if an image model could turn multi images into a coherent, dynamic visual universe? 🤯 iMontage brings video-like motion priors to image generation, enabling rich transitions and consistent multi-image outputs—all from your own inputs. Try it out below and explore your imagination!
- ⚡ High-dynamic, high-consistency image generation from flexible inputs
- 🎛️ Robust instruction following across heterogeneous tasks
- 🌀 Video-like temporal coherence, even for non-video image sets
- 🏆 SOTA results across different tasks
- 2025.11.26 – Arxiv version paper of iMontage is released.
- 2025.11.26 – Inference code and model weights of iMontage are released.
conda create -n iMontage python=3.10
conda activate iMontage
# NOTE Choose torch version compatible with your CUDA
pip install torch==2.6.0+cu126 torchvision==0.21.0+cu126 torchaudio==2.6.0+cu126 https://download.pytorch.org/whl/cu126
# Install Flash Attention 2
# NOTE Also choose the correct version compatible with installed torch
pip install "flash-attn==2.7.4.post1" --no-build-isolation
(Note) We train and evaluate our model with FlashAttention-3, so the inference quality might be suboptimal with flash-attn-2.
If you are working on NVIDIA H100/H800 GPUs and want to get the best performance of our model, you can follow the official guidance of FlashAttention-3 here. And you have to replace code in fastvideo/models/flash_attn_no_pad.py
After install torch and flash attention, you can install all other dependencies following this command:
pip install -e .mkdir ckpts/hyvideo_ckpts
# Downloading hunyuan-video-i2v-720p, may takes 10 minutes to 1 hour depending on network conditions.
huggingface-cli download tencent/HunyuanVideo-I2V --local-dir ./ckpts/hyvideo_ckpts
# Downloading text_encoder from HunyuanVideo-T2V
huggingface-cli download xtuner/llava-llama-3-8b-v1_1-transformers --local-dir ./ckpts/llava-llama-3-8b-v1_1-transformers
python fastvideo/models/hyvideo/utils/preprocess_text_encoder_tokenizer_utils.py --input_dir ckpts/llava-llama-3-8b-v1_1-transformers --output_dir ckpts/hyvideo_ckpts/text_encoder
# Downloading text_encoder_2 from HunyuanVideo-I2V
huggingface-cli download openai/clip-vit-large-patch14 --local-dir ./ckpts/hyvideo_ckpts/text_encoder_2
mkdir ckpts/iMontage_ckpts
# Downloading iMontage dit weights, also might takes some time.
huggingface-cli download Kr1sJ/iMontage --local-dir ./ckpts/iMontage_ckptsThe final ckpt file structure should be formed as:
iMontage
├──ckpts
│ ├──hyvideo_ckpts
│ │ ├──hunyuan-video-i2v-720p
│ │ │ ├──transformers
│ │ │ │ ├──mp_rank_00_model_states.pt
├ │ │ ├──vae
│ │ ├──text_encoder_i2v
│ │ ├──text_encoder_2
│ ├──iMontage_ckpts
│ │ ├──diffusion_pytorch_model.safetensors
│ ...
After installing the environment and downloading the pretrained weights, let's start with our infer example. Do note that our model now only support <= 4 inputs and <= 4 outputs.
Run the following command:
bash scripts/inference.shIn this example, we run inference with:
--prompt assets/prompt.jsonThe JSON file contains six representative tasks, including:
-
Image editing
-
Character reference generation (CRef)
-
CRef + Vision signal
-
Style reference generation (SRef)
-
Multi-view generation
-
Storyboard generation
Each entry specifies the task type, instruction prompt, input reference images, output resolution, and desired number of generated frames. Running the script will automatically process all tasks in the JSON and save the results under the output directory.
The expected results should be:
To inference with your own images, you should create a JSON file and create an entry like this:
"0" :
{
"task_type": "image_editing",
"prompts" : "Change the material of the lava to silver.",
"images" : [
"assets/images/llava.png"
],
"height" : 416,
"width" : 640,
"output_num" : 1
}
And instruction of all tasks can be concluded as:
| Task Type | Description | Inputs | Notes / Tips |
|---|---|---|---|
| image_editing | Edit the input image according to the instruction (material, style, object change, etc.). | 1 image | Prompt should clearly describe what to change. Best to align output size with input image size. |
| cref | Generate an output using multiple character reference images. | ≥ 1 images | Order of reference images matters. Prompt should specify who from which image. Best results with 2–4 reference images. |
| conditioned_cref | Generate an output using multi images and a vision signal control map (depth, canny, openpose). | ≥ 1 image | Only support depth, canny, openpose, prompt should be one of these three word. Put control map image in the first image. |
| sref | Apply the style/features of the reference images to generate a new image. | 2 images | Leave prompts empty if only using style; model will infer style from input images. Put style reference image in the second place. |
| multiview | Generate multiple viewpoints of the same scene. | 1 image | Prompt should contain step-by-step view changes (e.g., “move left”, “look up”, “zoom out”). output_num must match number of described views. NOTE Might generate unsatisfying results, please try with different prompts and seed. |
| storyboard | Generate a sequence of frames forming a short story based on references. | ≥ 1 images | Prompts should be enumerated (1, 2, 3…), and start with the story style word (Vintage file, Japanese anime, etc.). Use reference images to anchor characters or props. Output resolution often wider for cinematic style. |
We sincerely thank the open-source community for providing strong foundations that enabled this work.
In particular, we acknowledge the following projects for their models, datasets, and valuable insights:
- HunyuanVideo-T2V, HunyuanVideo-I2V – Provided base generative model designs and code.
- FastVideo – Contributed key components and open-source utilities that supported our development.
These contributions have greatly influenced our research and helped shape the design of iMontage.
If you find iMontage useful for your research or applications, please consider starring ⭐ the repo and citing our paper:
@article{fu2025iMontage,
title={iMontage: Unified, Versatile, Highly Dynamic Many-to-many Image Generation},
author={Zhoujie Fu and Xianfang Zeng and Jinghong Lan and Xinyao Liao and Cheng Chen and Junyi Chen and Jiacheng Wei and Wei Cheng and Shiyu Liu and Yunuo Chen and Gang Yu and Guosheng Lin},
journal={arXiv preprint arXiv:2511.20635},
year={2025},
}





















