iMontage: Unified, Versatile, Highly Dynamic Many-to-many Image Generation

What if an image model could turn multi images into a coherent, dynamic visual universe? 🤯 iMontage brings video-like motion priors to image generation, enabling rich transitions and consistent multi-image outputs—all from your own inputs. Try it out below and explore your imagination!

📦 Features

⚡ High-dynamic, high-consistency image generation from flexible inputs
🎛️ Robust instruction following across heterogeneous tasks
🌀 Video-like temporal coherence, even for non-video image sets
🏆 SOTA results across different tasks

📰 News

2025.11.26 – Arxiv version paper of iMontage is released.
2025.11.26 – Inference code and model weights of iMontage are released.

🛠 Installation

1. Create virtual environment

conda create -n iMontage python=3.10
conda activate iMontage

# NOTE Choose torch version compatible with your CUDA
pip install torch==2.6.0+cu126 torchvision==0.21.0+cu126 torchaudio==2.6.0+cu126 https://download.pytorch.org/whl/cu126

# Install Flash Attention 2
# NOTE Also choose the correct version compatible with installed torch
pip install "flash-attn==2.7.4.post1" --no-build-isolation

(Note) We train and evaluate our model with FlashAttention-3, so the inference quality might be suboptimal with flash-attn-2.

If you are working on NVIDIA H100/H800 GPUs and want to get the best performance of our model, you can follow the official guidance of FlashAttention-3 here. And you have to replace code in fastvideo/models/flash_attn_no_pad.py

After install torch and flash attention, you can install all other dependencies following this command:

pip install -e .

2. Download model weights

mkdir ckpts/hyvideo_ckpts

# Downloading hunyuan-video-i2v-720p, may takes 10 minutes to 1 hour depending on network conditions.
huggingface-cli download tencent/HunyuanVideo-I2V --local-dir ./ckpts/hyvideo_ckpts

# Downloading text_encoder from HunyuanVideo-T2V
huggingface-cli download xtuner/llava-llama-3-8b-v1_1-transformers --local-dir ./ckpts/llava-llama-3-8b-v1_1-transformers
python fastvideo/models/hyvideo/utils/preprocess_text_encoder_tokenizer_utils.py --input_dir ckpts/llava-llama-3-8b-v1_1-transformers --output_dir ckpts/hyvideo_ckpts/text_encoder

# Downloading text_encoder_2 from HunyuanVideo-I2V
huggingface-cli download openai/clip-vit-large-patch14 --local-dir ./ckpts/hyvideo_ckpts/text_encoder_2

mkdir ckpts/iMontage_ckpts
# Downloading iMontage dit weights, also might takes some time.
huggingface-cli download Kr1sJ/iMontage --local-dir ./ckpts/iMontage_ckpts

The final ckpt file structure should be formed as:

iMontage
  ├──ckpts
  │  ├──hyvideo_ckpts
  │  │  ├──hunyuan-video-i2v-720p
  │  │  │  ├──transformers
  │  │  │  │  ├──mp_rank_00_model_states.pt
  ├  │  │  ├──vae
  │  │  ├──text_encoder_i2v
  │  │  ├──text_encoder_2
  │  ├──iMontage_ckpts
  │  │  ├──diffusion_pytorch_model.safetensors
  │ ...

🚀 Inference

After installing the environment and downloading the pretrained weights, let's start with our infer example. Do note that our model now only support <= 4 inputs and <= 4 outputs.

🔹 Example

Run the following command:

bash scripts/inference.sh

In this example, we run inference with:

--prompt assets/prompt.json

The JSON file contains six representative tasks, including:

Image editing
Character reference generation (CRef)
CRef + Vision signal
Style reference generation (SRef)
Multi-view generation
Storyboard generation

Each entry specifies the task type, instruction prompt, input reference images, output resolution, and desired number of generated frames. Running the script will automatically process all tasks in the JSON and save the results under the output directory.

The expected results should be:

Task Type	Input	Prompt	Output
image_editing		Change the material of the lava to silver.
cref		Confucius from the first image, Moses from the second…
conditioned_cref		depth
sref		(empty)
multiview		1. Shift left; 2. Look up; 3. Zoom out.
storyboard		Vintage film: 1. Hepburn carrying the yellow bag…

🔹 Run your own job

To inference with your own images, you should create a JSON file and create an entry like this:

"0" :
    {
        "task_type": "image_editing",
        "prompts" : "Change the material of the lava to silver.",
        "images" : [
            "assets/images/llava.png"
        ],
        "height" : 416,
        "width" : 640,
        "output_num" : 1
    }

And instruction of all tasks can be concluded as:

Task Type	Description	Inputs	Notes / Tips
image_editing	Edit the input image according to the instruction (material, style, object change, etc.).	1 image	Prompt should clearly describe what to change. Best to align output size with input image size.
cref	Generate an output using multiple character reference images.	≥ 1 images	Order of reference images matters. Prompt should specify who from which image. Best results with 2–4 reference images.
conditioned_cref	Generate an output using multi images and a vision signal control map (depth, canny, openpose).	≥ 1 image	Only support depth, canny, openpose, prompt should be one of these three word. Put control map image in the first image.
sref	Apply the style/features of the reference images to generate a new image.	2 images	Leave `prompts` empty if only using style; model will infer style from input images. Put style reference image in the second place.
multiview	Generate multiple viewpoints of the same scene.	1 image	Prompt should contain step-by-step view changes (e.g., “move left”, “look up”, “zoom out”). `output_num` must match number of described views. NOTE Might generate unsatisfying results, please try with different prompts and seed.
storyboard	Generate a sequence of frames forming a short story based on references.	≥ 1 images	Prompts should be enumerated (1, 2, 3…), and start with the story style word (Vintage file, Japanese anime, etc.). Use reference images to anchor characters or props. Output resolution often wider for cinematic style.

💖 Acknowledgment

We sincerely thank the open-source community for providing strong foundations that enabled this work.
In particular, we acknowledge the following projects for their models, datasets, and valuable insights:

HunyuanVideo-T2V, HunyuanVideo-I2V – Provided base generative model designs and code.
FastVideo – Contributed key components and open-source utilities that supported our development.

These contributions have greatly influenced our research and helped shape the design of iMontage.

📝 Citation

If you find iMontage useful for your research or applications, please consider starring ⭐ the repo and citing our paper:

@article{fu2025iMontage,
  title={iMontage: Unified, Versatile, Highly Dynamic Many-to-many Image Generation}, 
  author={Zhoujie Fu and Xianfang Zeng and Jinghong Lan and Xinyao Liao and Cheng Chen and Junyi Chen and Jiacheng Wei and Wei Cheng and Shiyu Liu and Yunuo Chen and Gang Yu and Guosheng Lin},
  journal={arXiv preprint arXiv:2511.20635},
  year={2025},   
}

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
assets		assets
fastvideo		fastvideo
scripts		scripts
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

iMontage: Unified, Versatile, Highly Dynamic Many-to-many Image Generation

📦 Features

📰 News

🛠 Installation

1. Create virtual environment

2. Download model weights

🚀 Inference

🔹 Example

🔹 Run your own job

💖 Acknowledgment

📝 Citation

About

Uh oh!

Releases

Packages

Contributors 3

Languages

License

Kr1sJFU/iMontage

Folders and files

Latest commit

History

Repository files navigation

iMontage: Unified, Versatile, Highly Dynamic Many-to-many Image Generation

📦 Features

📰 News

🛠 Installation

1. Create virtual environment

2. Download model weights

🚀 Inference

🔹 Example

🔹 Run your own job

💖 Acknowledgment

📝 Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages