Skip to content

Kr1sJFU/iMontage

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

14 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

iMontage: Unified, Versatile, Highly Dynamic Many-to-many Image Generation

arXiv Project Page Demo HuggingFace

iMontage Teaser

What if an image model could turn multi images into a coherent, dynamic visual universe? 🤯 iMontage brings video-like motion priors to image generation, enabling rich transitions and consistent multi-image outputs—all from your own inputs. Try it out below and explore your imagination!

📦 Features

  • ⚡ High-dynamic, high-consistency image generation from flexible inputs
  • 🎛️ Robust instruction following across heterogeneous tasks
  • 🌀 Video-like temporal coherence, even for non-video image sets
  • 🏆 SOTA results across different tasks

📰 News

  • 2025.11.26 – Arxiv version paper of iMontage is released.
  • 2025.11.26 – Inference code and model weights of iMontage are released.

🛠 Installation

1. Create virtual environment

conda create -n iMontage python=3.10
conda activate iMontage

# NOTE Choose torch version compatible with your CUDA
pip install torch==2.6.0+cu126 torchvision==0.21.0+cu126 torchaudio==2.6.0+cu126 https://download.pytorch.org/whl/cu126

# Install Flash Attention 2
# NOTE Also choose the correct version compatible with installed torch
pip install "flash-attn==2.7.4.post1" --no-build-isolation

(Note) We train and evaluate our model with FlashAttention-3, so the inference quality might be suboptimal with flash-attn-2.

If you are working on NVIDIA H100/H800 GPUs and want to get the best performance of our model, you can follow the official guidance of FlashAttention-3 here. And you have to replace code in fastvideo/models/flash_attn_no_pad.py

After install torch and flash attention, you can install all other dependencies following this command:

pip install -e .

2. Download model weights

mkdir ckpts/hyvideo_ckpts

# Downloading hunyuan-video-i2v-720p, may takes 10 minutes to 1 hour depending on network conditions.
huggingface-cli download tencent/HunyuanVideo-I2V --local-dir ./ckpts/hyvideo_ckpts

# Downloading text_encoder from HunyuanVideo-T2V
huggingface-cli download xtuner/llava-llama-3-8b-v1_1-transformers --local-dir ./ckpts/llava-llama-3-8b-v1_1-transformers
python fastvideo/models/hyvideo/utils/preprocess_text_encoder_tokenizer_utils.py --input_dir ckpts/llava-llama-3-8b-v1_1-transformers --output_dir ckpts/hyvideo_ckpts/text_encoder

# Downloading text_encoder_2 from HunyuanVideo-I2V
huggingface-cli download openai/clip-vit-large-patch14 --local-dir ./ckpts/hyvideo_ckpts/text_encoder_2

mkdir ckpts/iMontage_ckpts
# Downloading iMontage dit weights, also might takes some time.
huggingface-cli download Kr1sJ/iMontage --local-dir ./ckpts/iMontage_ckpts

The final ckpt file structure should be formed as:

iMontage
  ├──ckpts
  │  ├──hyvideo_ckpts
  │  │  ├──hunyuan-video-i2v-720p
  │  │  │  ├──transformers
  │  │  │  │  ├──mp_rank_00_model_states.pt
  ├  │  │  ├──vae
  │  │  ├──text_encoder_i2v
  │  │  ├──text_encoder_2
  │  ├──iMontage_ckpts
  │  │  ├──diffusion_pytorch_model.safetensors
  │ ...

🚀 Inference

After installing the environment and downloading the pretrained weights, let's start with our infer example. Do note that our model now only support <= 4 inputs and <= 4 outputs.

🔹 Example

Run the following command:

bash scripts/inference.sh

In this example, we run inference with:

--prompt assets/prompt.json

The JSON file contains six representative tasks, including:

  • Image editing

  • Character reference generation (CRef)

  • CRef + Vision signal

  • Style reference generation (SRef)

  • Multi-view generation

  • Storyboard generation

Each entry specifies the task type, instruction prompt, input reference images, output resolution, and desired number of generated frames. Running the script will automatically process all tasks in the JSON and save the results under the output directory.

The expected results should be:

Task Type Input Prompt Output
image_editing Change the material of the lava to silver.
cref Confucius from the first image, Moses from the second…
conditioned_cref depth
sref (empty)
multiview 1. Shift left; 2. Look up; 3. Zoom out.
storyboard Vintage film: 1. Hepburn carrying the yellow bag…

🔹 Run your own job

To inference with your own images, you should create a JSON file and create an entry like this:

"0" :
    {
        "task_type": "image_editing",
        "prompts" : "Change the material of the lava to silver.",
        "images" : [
            "assets/images/llava.png"
        ],
        "height" : 416,
        "width" : 640,
        "output_num" : 1
    }

And instruction of all tasks can be concluded as:

Task Type Description Inputs Notes / Tips
image_editing Edit the input image according to the instruction (material, style, object change, etc.). 1 image Prompt should clearly describe what to change. Best to align output size with input image size.
cref Generate an output using multiple character reference images. ≥ 1 images Order of reference images matters. Prompt should specify who from which image. Best results with 2–4 reference images.
conditioned_cref Generate an output using multi images and a vision signal control map (depth, canny, openpose). ≥ 1 image Only support depth, canny, openpose, prompt should be one of these three word. Put control map image in the first image.
sref Apply the style/features of the reference images to generate a new image. 2 images Leave prompts empty if only using style; model will infer style from input images. Put style reference image in the second place.
multiview Generate multiple viewpoints of the same scene. 1 image Prompt should contain step-by-step view changes (e.g., “move left”, “look up”, “zoom out”). output_num must match number of described views. NOTE Might generate unsatisfying results, please try with different prompts and seed.
storyboard Generate a sequence of frames forming a short story based on references. ≥ 1 images Prompts should be enumerated (1, 2, 3…), and start with the story style word (Vintage file, Japanese anime, etc.). Use reference images to anchor characters or props. Output resolution often wider for cinematic style.

💖 Acknowledgment

We sincerely thank the open-source community for providing strong foundations that enabled this work.
In particular, we acknowledge the following projects for their models, datasets, and valuable insights:

  • HunyuanVideo-T2V, HunyuanVideo-I2V – Provided base generative model designs and code.
  • FastVideo – Contributed key components and open-source utilities that supported our development.

These contributions have greatly influenced our research and helped shape the design of iMontage.


📝 Citation

If you find iMontage useful for your research or applications, please consider starring ⭐ the repo and citing our paper:

@article{fu2025iMontage,
  title={iMontage: Unified, Versatile, Highly Dynamic Many-to-many Image Generation}, 
  author={Zhoujie Fu and Xianfang Zeng and Jinghong Lan and Xinyao Liao and Cheng Chen and Junyi Chen and Jiacheng Wei and Wei Cheng and Shiyu Liu and Yunuo Chen and Gang Yu and Guosheng Lin},
  journal={arXiv preprint arXiv:2511.20635},
  year={2025},   
}

About

iMontage: Unified, Versatile, Highly Dynamic Many-to-many Image Generation

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published