UniVerse: A Unified Modulation Framework for Segmentation-Free, Disentangled Multi-Concept Personalization
Abstract
Personalized visual understanding has advanced significantly, yet existing approaches struggle to localize and extract specific concepts when input images contain multiple objects. Many prior methods rely heavily on segmentation-based supervision or exhibit poor compositional generalization, limiting their ability to accurately disentangle and manipulate individual concepts. In this work, we propose UniVerse, a Unified Modulation Framework for segmentation-free, disentangled multi-concept personalization in diffusion transformers. Our method allows for composable and decomposable concept extraction, enabling fine-grained localization and representation of target objects without explicit segmentation masks. UniVerse learns to decompose complex scenes into concept-specific representations and then compose them in a unified manner, enabling robust personalization across diverse visual contexts. Through extensive experiments on multiple benchmarks, we demonstrate that UniVerse significantly outperforms state-of-the-art baselines in both localization accuracy and visual fidelity. Qualitative and quantitative results show that our approach can precisely extract target concepts in cluttered scenes, paving the way for more flexible, interpretable, and personalized visual generation and understanding.
1 Introduction
The proliferation of text-to-image (T2I) generation models has unlocked remarkable capabilities in content creation. A key frontier in this field is personalization/customization: the ability to synthesize novel images featuring specific subjects, objects, or artistic styles provided by a user. This demand has spurred a rapid evolution of techniques. Early approaches, such as DreamBooth [28] and Textual Inversion [6], achieved high-fidelity personalization by fine-tuning a model on a few reference images. While effective, this tuning-based paradigm is computationally expensive and requires a distinct optimization process for each new visual concept.
To address this, a second wave of tuning-free methods emerged, including IP-Adapter [37], MIP-Adapter [11], PhotoMaker [14], and PuLID [8]. These frameworks inject visual features from a reference image directly into the diffusion process, enabling flexible, zero-shot personalization without any per-subject training. However, many of these methods are built upon U-Net [27] architectures with relatively weak text encoders. This can limit their ability to handle complex compositions and nuanced semantic control, especially when compared to more recent, large-scale architectures.
The development of Diffusion Transformers [19] (DiTs) marked a significant shift, offering superior scalability and a greater capacity for complex, multi-concept generation. This led to a new line of work, such as OmniGen [35], UNO [34], and DreamO [17], which leverage the transformer architecture to compose multiple distinct concepts within a single image. However, these unified transformer models often struggle with feature entanglement — where attributes from one subject leak into another — and with global feature injection, which degrades overall image quality.
To overcome this entanglement problem, recent work has focused on modulation-based approaches, such as TokenVerse [7], Mod-Adapter [40], and XVerse [2]. Instead of injecting broad visual features, these methods achieve finer control by modifying the text embedding signal itself. By calculating and applying modulation offsets to the text-conditioning stream, they can guide the generation process with high precision, preventing attribute leakage and enabling disentangled, high-quality personalization across multiple subjects.
However, these state-of-the-art modulation approaches introduce a critical new limitation. They predominantly require clean, pre-segmented reference images. This requirement severely curtails their practical utility, as real-world “in-the-wild” use cases are dominated by complex, unsegmented photos. This in-the-wild challenge highlights a crucial, yet under-explored, aspect of personalization: the need for a precise reference prompt to guide concept extraction. Previous works [38] have largely overlooked this, either attempting to identify the desired concept using a generic single-word label (e.g., “person”) or bypassing the ambiguity entirely with segmentation masks, which are often unavailable. Neither approach is sufficient. A simplistic prompt cannot disambiguate a specific subject from a cluttered background (e.g., “the man on the left” in a group photo), and segmentation fails for many abstract concepts users wish to personalize, such as artistic styles, specific textures, or material properties. This dependency on manual pre-processing or overly simplistic prompts prevents these powerful models from generalizing to the very unconstrained scenarios where personalization is most valuable.
To bridge this crucial gap, we propose the Unified Modulation Framework—UniVerse—a novel framework for true segmentation-free, in-the-wild subject-driven generation. Unlike prior work that focuses on either visual-only conditioning or text-only modulation, or combines them loosely, our core innovation is a single, unified Reference Condition Extractor (RCE) that effectively extracts both visual conditional latents (for appearance) and textual conditional offsets (for semantics). Crucially, these two conditions are yielded by a single module and semantically aligned with the reference prompt, ensuring cohesive adherence to the image generation process. This dual-extraction pipeline allows our model to automatically decompose complex reference images, identify, and disentangle multiple concepts—from distinct object identities to abstract artistic styles. These disentangled concepts can then be flexibly composed to synthesize new, complex scenes. Fig. 1 illustrates the decomposition ability of our approach on in-the-wild reference images while maintaining the photorealism of the generated images in new contexts or when combining multiple concepts.
In this work, our proposed solutions include: (i) the Reference Condition Extractor, the first framework to extract semantically-aligned visual and textual conditions from a single module, guided by a reference prompt, to enable robust multi-concept decomposition and composition. (ii) a two-stage training strategy: the extractor is first supervised on a reference segmentation dataset, teaching it to accurately localize concepts from a prompt, and is then jointly trained for the full generation task, resulting in significantly improved generalization; (iii) UniVerseBench, a new benchmark of multi-concept reference images designed to rigorously evaluate prompt-guided concept decomposition, testing a model’s ability to disambiguate and extract the correct concept—something existing benchmarks do not adequately cover.
2 Related Work
Subject-driven generation. The main challenge in subject-driven generation is maintaining a subject’s visual identity while allowing flexible text-based editing. Early works achieve this through fine-tuning approaches, where pre-trained diffusion models are adapted to new subjects using a few reference images (e.g., DreamBooth [28], Textual Inversion [6]). Later methods introduce more efficient tuning-free strategies that inject subject features directly into the diffusion process without model retraining. Representative works include IP-Adapter [37], MIP-Adapter [11], SSR-Encoder [39], PhotoMaker [14], and PulID [8], which employ pre-trained image encoders and cross-attention mechanisms to transfer visual representations, supporting zero-shot or low-resource personalization across subjects.
Multi-concept control and feature injection. Scaling personalization to multiple concepts or subjects introduces challenges of feature disentanglement and spatial control. A large body of work leverages attention-based conditioning to manage multiple subjects and modalities within a unified framework, such as OmniGen [35], OmniGen2 [32], DreamO [17], UNO [34], and MS-Diffusion [31]. Other methods focus on localized editing and grounding, providing fine-grained spatial control using auxiliary cues such as segmentation maps, bounding boxes, or depth information. Examples include Break-a-scene [1] and SeedEdit [30], which enable natural, in-the-wild customization with strong spatial grounding.
DiT-Based Modulation Approaches. The recent shift from UNet-based architectures to Diffusion Transformers (DiTs) [19] has enabled more structured and scalable conditioning mechanisms. In these models, Adaptive Layer Normalization (AdaLN) [36] provides a clean way to modulate the generation process via learned scale and bias terms. Building on this principle, frameworks such as TokenVerse [7] and XVerse [2] demonstrate that modulating the text-token space can effectively control multiple concepts without explicit masks or segmentation. These token-based modulation techniques offer fine-grained, disentangled personalization, paving the way for segmentation-free and highly controllable subject customization, which motivates the design of the proposed framework.
Table 1 summarizes the ability of our method, UniVerse, with existing personalized image generation models.
| Model | Concept | Concept | Multiple | Abstract |
| Comp. | Decomp. | Concepts | Concepts | |
| XVerse [2] | ✓ | ✗ | ✓ | ✓ |
| UNO [34] | ✓ | ✗ | ✓ | ✓ |
| DreamO [17] | ✓ | ✗ | ✓ | ✓ |
| OmniGen [35] | ✓ | ✗ | ✓ | ✗ |
| MS-Diffusion [31] | ✓ | ✗ | ✗ | ✓ |
| UniVerse (Ours) | ✓ | ✓ | ✓ | ✓ |
3 Methodology
In this section, we first review the DiT modulation used for personalization/customization in previous works. We then discuss the UniVerse framework, which introduces a novel module to extract both visual and textual additional conditions. The following section details our approach to dataset preparation for handling in-the-wild reference images.


3.1 Preliminaries
Diffusion Transformers (DiTs) have become the foundational architecture for scalable image synthesis, replacing UNets [29, 25, 26, 22] in models such as Stable Diffusion 3 [5]. DiTs employ a unique, high-level mechanism for integrating conditioning information (e.g., the CLIP [23] text prompt embedding and the timestep ), known as modulation. This is achieved through Adaptive Layer Normalization (AdaLN) [36], where a Multi-Layer Perceptron (MLP) processes the inputs to generate a conditioning vector :
| (1) |
This vector is then split into scale () and shift () terms that dynamically modulate the network’s activations, effectively integrating semantic control separately from the primary data flow. TokenVerse [7] pioneered injecting personalized identity features directly into this modulation pathway by learning a personalized vector offset () per text token rather than using the same vector to modulate all tokens. Building on this, XVerse [2] achieved tuning-free (zero-shot) subject-specific control by using a universal adapter to generate an offset for th token from its corresponding reference image . Here, the new offset is added to the modulation vector as:
3.2 UniVerse Framework
We improve the tuning-free approach in generating for the modulation in DiTs. Besides the reference image and corresponding token , our model accepts the reference prompt describing the reference object in the context of the reference image. It will help the model know which concept to extract from the reference image. In some cases, the reference prompt contradicts the prompt token in the full image. For example, in Fig. 2, where “sitting on the grass” is the action of the man in the reference image, it moves to “riding a horse” in the final prompt.
Our main pipeline, illustrated in Fig. 2, includes the Reference Condition Extractor (RCE), which generates both textual and additional visual conditions for DiTs during image generation. While the textual condition is in the form of modulated offset , the visual latent is appended to all latent inputs as additional conditions for denoising the image. The procedure for obtaining the two conditions is described in the following paragraphs.
Visual Reference Latents. We leverage both CLIP [23] image and text encoders to extract reference-image and prompt features, respectively. The visual features are then modulated by the textual features via Feature-wise Linear Modulation (FiLM) [21], where it modifies each visual vector to remove unnecessary information. Given the visual features and textual features where is the number of visual tokens and is the feature dimension, the modulated visual features are modifying as:
| (2) |
This function modulates each vector by shift and scale derived from functions and . The following MLP layer projects the conditions to the DiT latent space as .
Textual Reference Offset. Following XVerse [2], we inject visual features into the T5 [24] embeddings of prompt token via a Perceiver [13] layer. However, instead of using the CLIP image features directly, we leverage modulated visual features with non-essential information removed. At the end, we learn two modulation offsets for each reference token , one shared between blocks and a specific for each block. The final modulation vector for th token at block is
3.3 Two-stage Training Pipeline
Our proposed training approach is shown in Fig. 2 (b) and (c). There are two stages: We first pretrain the FiLM [21] module on a large-scale dataset and then finetune with other modules in the second stage on our multi-concept dataset.
In the first stage, we train the single FiLM layer alone with output from the CLIP encoders. The process is supervised by reference instance segmentation, where an additional CLIPSeg Decoder [16] is added to predict a coarse segmentation mask conditioned on the text. We use binary cross-entropy as our loss . In this stage, only the FiLM and the CLIPSeg Decoder are trained.
In the second stage, we train the entire pipeline, including the Reference Condition Extractor (RCE) and DiT, on the reference-image-generation task. All encoder networks remain frozen, while the remaining components of the RCE are trainable. The FiLM module is initialized from the previous stage, while other layers are trained from scratch. We add low-rank parameters (LoRA [9]) to DiT and train them, while modulating their normalization parameters with learned offsets from RCE. The standard diffusion loss is used on the noise prediction.
4 Experiment
We conduct both quantitative and qualitative evaluations, demonstrating that UniVerse surpasses existing methods in accurately extracting multiple visual concepts from reference images and effectively integrating them to generate new, coherent images. We will open-source our code and pretrained model for reproducible research.
4.1 Implementation Details
Training Datasets. We train our model using publicly available datasets and our own curated datasets. For the first stage, we pretrain our Reference Condition Extractor (RCE) on PhraseCut [33], a large-scale dataset for reference image segmentation. For the second stage, inspired by prior work [34, 2], we build our own conceptual dataset using images from UNO-1M [34]. With a limited number of multi-concept samples, inspired by prior work [2], we also horizontally combine the reference images to help the model learn to extract the correct concepts. In our setting, we call this augmentation technique Cross-Reference. Details of the dataset creation pipeline are available in the supplementary material.
Benchmarks. We evaluate our model and baselines on two public benchmarks: DreamBench++ [20] and XVerseBench [2]. While DreamBench++ is designed to evaluate single-concept personalization in text-to-image generation, XVerseBench extends the evaluation to multi-concept composition and fine-grained attribute control.
To further evaluate models’ ability to disentangle co-occurring visual concepts within the same reference images, we propose a new benchmark, UniVerseBench. The dataset for UniVerseBench consists of 20 reference images and 200 distinct prompts to evaluate single- and multi-subject image generation. Unlike previous benchmarks, UniVerseBench focuses on object decomposition from reference images. Each reference image consists of two co-occurring subjects, challenging models to extract the correct concept under these conditions.
| Method | Single-Subject | Multi-Subject | Overall | ||||||||
| DPG | ID-S | IP-S | AES | Avg | DPG | ID-S | IP-S | AES | Avg | ||
| UNO [34] | 96.04 | 52.16 | 67.10 | 57.89 | 68.30 | 88.62 | 35.06 | 59.02 | 54.66 | 59.34 | 63.82 |
| DreamO [17] | 97.19 | 75.48 | 66.91 | 56.27 | 73.95 | 89.73 | 52.53 | 61.77 | 53.85 | 64.47 | 69.21 |
| OmniGen [35] | 90.61 | 76.63 | 69.86 | 54.70 | 72.95 | 87.67 | 74.34 | 57.18 | 53.45 | 68.16 | 70.56 |
| OmniGen2 [32] | 96.65 | 60.76 | 66.53 | 53.31 | 69.31 | 91.29 | 38.48 | 60.81 | 52.56 | 60.79 | 65.05 |
| MS-Diffusion [31] | 89.20 | 47.42 | 70.28 | 56.67 | 65.89 | 80.06 | 24.70 | 51.17 | 54.83 | 52.69 | 59.29 |
| MIP-Adapter [11] | 80.04 | 39.22 | 65.95 | 54.15 | 59.84 | 83.60 | 21.04 | 49.61 | 53.43 | 51.92 | 55.88 |
| XVerse [2] | 92.52 | 79.80 | 67.68 | 57.43 | 74.36 | 87.40 | 67.15 | 62.59 | 54.59 | 67.93 | 71.15 |
| UniVerse (Ours) | 91.93 | 82.77 | 75.88 | 55.86 | 78.14 | 87.95 | 66.69 | 71.60 | 54.44 | 70.18 | 74.16 |
Evaluation Metrics. We follow two evaluation protocols from previous benchmarks: VLM-as-a-judge and feature-based scores. The prior protocol is used in DreamBench++ and leverages GPT-4o [12] to grade (0 to 1) each generated image and its corresponding inputs. It evaluates models in Concept Preservation (CP), Prompt Fidelity (PF), and their multiplication. The latter metric is used in XVerseBench and uses pretrained models to compute the similarity between generated images and their inputs. The metrics include Dense Prompt graph (DPG) [10] measuring the prompt alignment, Identity Similarity (ID-S) [3] in human identity preservation, Perceptual Similarity (IP-S) with DINOv2 [18] for object appearance consistency, and Attribute Editing Score (AES) with a SigLIP-based predictor [4] for evaluating overall aesthetic quality. In UniVerseBench, we use IP-S and AES metrics to evaluate single- and multi-subject generation quality.
Baselines. We compare UniVerse against several state-of-the-art personalized image generation models, including UNO [34], DreamO [17], OmniGen [35], OmniGen2 [32], MS-Diffusion [31], and MIP-Adapter [11], and XVerse [2]. For all baselines, we used the default configurations in their respective code repositories for evaluation and generation, or the evaluation configurations specified in their respective papers. For consistency in comparison, we set all models to generate a target image size of 768.
Model Architecture and Training Details. For our RCE, we use the pretrained CLIP-L/14-224 [23] to extract reference image and prompt features. The perceiver layer follows XVerse implementation. The MLP layer includes two linear layers with an activation in between and a layer norm at the end. For DiT, we use LoRA [9] with a rank of 128 to adapt to new conditions. The first stage consists of 10 epochs with a learning rate of and a cosine scheduler. We save the best epoch based on IoU on the validation set. For the second stage, we train for 150K iterations in total: the first 100K steps learn the shared offset, and the remaining 50K steps jointly train the block-wise adaptations. We use a learning rate of with the AdamW [15] optimizer and a batch size of 16 across 8 NVIDIA A100 GPUs.
4.2 Qualitative Results
We present qualitative results of UniVerse in Figures 3, 4, and 5. In single-subject settings (Figure 3), prior methods frequently exhibit concept leakage, incomplete attribute preservation, or identity drift, whereas UniVerse consistently extracts the intended concept and maintains subject fidelity—even in challenging cases requiring disentanglement of similar concepts.
In multi-subject scenarios (Figure 4), baseline models struggle with leakage, incorrect attribute transfer, and compositional failures, particularly when multiple references overlap in concepts. UniVerse reliably separates and preserves each concept, capturing fine-grained attributes and composing them coherently.
Figure 5 shows three UniVerse-generated images for each group of reference concepts. Across these examples, UniVerse reliably extracts the relevant visual elements from the reference images and recombines them in coherent new scenes. In addition to preserving the core subjects, UniVerse maintains subtle attributes like clothing details, accessories, and overall appearance cues, and can flexibly reimagine these elements within novel scene compositions.
UniVerse demonstrates robust identity-preserving composition in multi-human scenarios Figure 6. Furthermore, our method successfully disentangles and composes not only discrete objects but also abstract, non-object attributes such as pose and material Figure 7. While UniVerse maintains high identity fidelity for up to six objects Figure 8, it faces compositional capacity limits as the object count increases (7–9). In such high-density scenes, the model may exhibit object omission or identity crosstalk.
4.3 Quantitative Results
On XVerseBench (Table 2), UniVerse achieves the highest overall performance for both single-subject and multi-subject generation. In single-subject evaluation, UniVerse demonstrates superior identity preservation (ID-S) and appearance similarity (IP-S), achieving an average score of 78.14, outperforming the second-best model, XVerse, by over 3 points. In multi-subject scenarios, UniVerse shows strong cross-image compositional abilities, surpassing all baselines by over 2 points. While single-subject generation is not the main focus of UniVerse, we achieve competitive performance. We show the result on Dreambench++ in the supplementary material.
On UniVerseBench (Table 3), UniVerse consistently surpasses baseline models in both single- and multi-subject evaluations. These results demonstrate UniVerse’s strong generalization across personalization scenarios, achieving state-of-the-art performance in compositional generation while preserving fine-grained visual fidelity and semantic coherence.
4.4 Ablation Studies
We conducted a thorough ablation study to validate the contributions of the key components in our UniVerse model, with results summarized in Table 4. Our baseline model achieves strong performance on the multi-subject UniVerseBench, with an average score of 48.64. When we remove the Reference Condition Extractor (RCE) pretraining stage, the performance drops by 0.75 points, confirming its positive impact. Similarly, removing the cross-reference mechanism during training also degrades performance, resulting in a 0.74 point drop. The most significant finding is that removing the visual reference latents during inference causes the largest performance decline, with the average score dropping by 1.50 points. These results conclusively demonstrate that all ablated components are integral and beneficial to the model’s overall effectiveness.
| Method | Single-Subject | Multi-Subject | Overall | ||||
| IP-S | AES | Avg | IP-S | AES | Avg | ||
| UNO [34] | 39.92 | 56.08 | 48.00 | 37.91 | 55.43 | 46.67 | 47.19 |
| DreamO [17] | 45.49 | 54.16 | 49.83 | 39.99 | 54.35 | 47.17 | 48.97 |
| OmniGen [35] | 47.72 | 52.64 | 50.18 | 41.75 | 54.67 | 48.21 | 49.53 |
| OmniGen2 [32] | 44.99 | 48.94 | 46.97 | 40.55 | 52.64 | 46.60 | 46.50 |
| MS-Diffusion [31] | 50.42 | 53.02 | 51.72 | 40.98 | 54.72 | 47.85 | 50.49 |
| MIP-Adapter [11] | 45.96 | 48.44 | 47.20 | 37.46 | 51.05 | 44.26 | 46.31 |
| XVerse [2] | 47.11 | 55.89 | 51.50 | 40.07 | 51.24 | 45.67 | 48.59 |
| UniVerse (Ours) | 51.49 | 54.62 | 53.06 | 42.29 | 54.98 | 48.64 | 51.05 |
| Settings | IP-S | AES | Avg | Avg |
| Baseline | 42.29 | 54.98 | 48.64 | 0.00 |
| Different training strategy | ||||
| No RCE Pretraining | 41.82 | 53.96 | 47.89 | -0.75 |
| No Cross-Reference | 41.95 | 53.84 | 47.90 | -0.74 |
| Visual condition in personalization | ||||
| No Visual Latents | 40.15 | 54.12 | 47.14 | -1.50 |
5 Discussions
Our method has several limitations. First, a broader challenge in the field is a lack of a comprehensive segmentation-free benchmark for multi-reference generation; a future benchmark with richer reference sets (e.g., 3+ concepts each with multiple attributes) would enable more rigorous evaluation. Second, our model is not fully robust to concept interference (leakage), though restrictive prompts like “just the cat” help to mitigate this. Our method also occasionally overfits to a reference subject, and performance degrades when prompts are vague or nonsensical.
In this paper, we presented UniVerse, a unified modulation framework designed to address a critical limitation in personalized visual understanding: the inability to localize and disentangle concepts within multi-object scenes. Our approach successfully moves beyond the need for segmentation-based supervision, enabling robust, segmentation-free personalization within diffusion transformers. We demonstrated that UniVerse can not only customize generative outputs but also precisely localize target concepts, learning to compose complex scenes and decompose them into their constituent parts. Our extensive experiments show that UniVerse significantly outperforms state-of-the-art baselines in both localization accuracy and visual fidelity. By enabling decomposable concept extraction even in cluttered images, our work paves the way for more flexible, interpretable, and controllable personalized generation.
References
- [1] (2023) Break-a-scene: extracting multiple concepts from a single image. In SIGGRAPH Asia 2023 Conference Papers, Cited by: §2.
- [2] (2025) XVerse: consistent multi-subject control of identity and semantic attributes via dit modulation. arXiv preprint arXiv:2506.21416. Cited by: §1, Table 1, §2, Figure 3, Figure 3, Figure 4, Figure 4, §3.1, §3.2, §4.1, §4.1, §4.1, Table 2, Table 3.
- [3] (2019) Arcface: additive angular margin loss for deep face recognition. In CVPR, Cited by: §4.1.
- [4] (2024) SigLIP-based aesthetic score predictor v2.5. Note: https://github.com/discus0434/aesthetic-predictor-v2-5GitHub repository, accessed 2025-11-13 Cited by: §4.1.
- [5] (2024) Scaling rectified flow transformers for high-resolution image synthesis. In ICML, Cited by: §3.1.
- [6] (2022) An image is worth one word: personalizing text-to-image generation using textual inversion. arXiv preprint arXiv:2208.01618. Cited by: §1, §2.
- [7] (2025) Tokenverse: versatile multi-concept personalization in token modulation space. ACM Transactions On Graphics (TOG) 44 (4), pp. 1–11. Cited by: §1, §2, §3.1.
- [8] (2024) Pulid: pure and lightning id customization via contrastive alignment. In NeurIPS, Cited by: §1, §2.
- [9] (2022) Lora: low-rank adaptation of large language models.. In ICLR, Cited by: Figure 2, Figure 2, §3.3, §4.1.
- [10] (2024) Ella: equip diffusion models with llm for enhanced semantic alignment. arXiv preprint arXiv:2403.05135. Cited by: §4.1.
- [11] (2025) Resolving multi-condition confusion for finetuning-free personalized image generation. In Proceedings of the AAAI Conference on Artificial Intelligence, Cited by: §1, §2, Figure 3, Figure 3, Figure 4, Figure 4, §4.1, Table 2, Table 3.
- [12] (2024) Gpt-4o system card. arXiv preprint arXiv:2410.21276. Cited by: §4.1.
- [13] (2021) Perceiver: general perception with iterative attention. In ICML, Cited by: §3.2.
- [14] (2024) Photomaker: customizing realistic human photos via stacked id embedding. In CVPR, Cited by: §1, §2.
- [15] (2017) Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101. Cited by: §4.1.
- [16] (2022) Image segmentation using text and image prompts. In CVPR, Cited by: §3.3.
- [17] (2025) Dreamo: a unified framework for image customization. arXiv preprint arXiv:2504.16915. Cited by: §1, Table 1, §2, Figure 3, Figure 3, Figure 4, Figure 4, §4.1, Table 2, Table 3.
- [18] (2023) Dinov2: learning robust visual features without supervision. arXiv preprint arXiv:2304.07193. Cited by: §4.1.
- [19] (2023) Scalable diffusion models with transformers. In ICCV, Cited by: §1, §2.
- [20] (2024) Dreambench++: a human-aligned benchmark for personalized image generation. arXiv preprint arXiv:2406.16855. Cited by: §4.1.
- [21] (2018) Film: visual reasoning with a general conditioning layer. In AAAI, Cited by: Figure 2, Figure 2, §3.2, §3.3.
- [22] (2023) Sdxl: improving latent diffusion models for high-resolution image synthesis. arXiv preprint arXiv:2307.01952. Cited by: §3.1.
- [23] (2021) Learning transferable visual models from natural language supervision. In ICML, Cited by: Figure 2, Figure 2, §3.1, §3.2, §4.1.
- [24] (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of machine learning research 21 (140), pp. 1–67. Cited by: Figure 2, Figure 2, §3.2.
- [25] (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125 1 (2), pp. 3. Cited by: §3.1.
- [26] (2022) High-resolution image synthesis with latent diffusion models. In CVPR, Cited by: §3.1.
- [27] (2015) U-net: convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention, Cited by: §1.
- [28] (2023) Dreambooth: fine tuning text-to-image diffusion models for subject-driven generation. In CVPR, Cited by: §1, §2.
- [29] (2022) Photorealistic text-to-image diffusion models with deep language understanding. In NeurIPS, Cited by: §3.1.
- [30] (2024) Seededit: align image re-generation to image editing. arXiv preprint arXiv:2411.06686. Cited by: §2.
- [31] (2024) Ms-diffusion: multi-subject zero-shot image personalization with layout guidance. arXiv preprint arXiv:2406.07209. Cited by: Table 1, §2, Figure 3, Figure 3, §4.1, Table 2, Table 3.
- [32] (2025) OmniGen2: exploration to advanced multimodal generation. arXiv preprint arXiv:2506.18871. Cited by: §2, Figure 3, Figure 3, Figure 4, Figure 4, §4.1, Table 2, Table 3.
- [33] (2020) Phrasecut: language-based image segmentation in the wild. In CVPR, Cited by: §4.1.
- [34] (2025) Less-to-more generalization: unlocking more controllability by in-context generation. arXiv preprint arXiv:2504.02160. Cited by: §1, Table 1, §2, Figure 3, Figure 3, Figure 4, Figure 4, §4.1, §4.1, Table 2, Table 3.
- [35] (2025) Omnigen: unified image generation. In CVPR, Cited by: §1, Table 1, §2, Figure 3, Figure 3, §4.1, Table 2, Table 3.
- [36] (2019) Understanding and improving layer normalization. In NeurIPS, Cited by: §2, §3.1.
- [37] (2023) Ip-adapter: text compatible image prompt adapter for text-to-image diffusion models. arXiv preprint arXiv:2308.06721. Cited by: §1, §2.
- [38] (2025) A survey on personalized content synthesis with diffusion models. Machine Intelligence Research 22 (5), pp. 817–848. Cited by: §1.
- [39] (2024) Ssr-encoder: encoding selective subject representation for subject-driven generation. In CVPR, Cited by: §2.
- [40] (2025) Mod-adapter: tuning-free and versatile multi-concept personalization via modulation adapter. arXiv preprint arXiv:2505.18612. Cited by: §1.