arxiv: 2605.12088 · v2 · submitted 2026-05-12 · 💻 cs.CV

Recognition: no theorem link

UniCustom: Unified Visual Conditioning for Multi-Reference Image Generation

Fuli Feng, Kun Gai, Pengfei Wan, Qiulin Wang, Wenjie Wang, Xintao Wang, Yiyan Xu, Yunyao Mao

Pith reviewed 2026-05-13 05:51 UTC · model grok-4.3

classification 💻 cs.CV

keywords multi-reference image generationvisual conditioningfeature fusiondiffusion modelssubject consistencyVLMimage synthesis

0 comments

The pith

Fusing semantic ViT features with appearance-rich VAE features early before VLM encoding allows the model to better associate multiple reference subjects with their specific visual details.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that separating semantic processing from appearance details in existing VLM-enhanced diffusion models creates a mismatch where the system knows which subject is intended but cannot reliably transfer its identity and fine details. By merging the two feature types with a simple linear layer before they reach the VLM, the hidden states can carry both the high-level reference and the low-level appearance together from the start. The method uses reconstruction pretraining to keep details intact, followed by generation finetuning and a slot-wise regularization term that discourages mixing features across references. If the approach works, multi-reference image generation would produce outputs that more faithfully match text instructions to the correct reference appearances without leakage or confusion. Readers would care because current methods often degrade quickly once more than one reference image is supplied, restricting reliable use in customized scene creation.

Core claim

UniCustom proposes that early fusion of ViT semantic features and VAE appearance features before VLM encoding, implemented through a lightweight linear layer, produces hidden states that jointly represent each referred subject and its corresponding visual details. This unified conditioning is learned via a two-stage process of reconstruction-oriented pretraining to retain reference-specific appearance followed by supervised finetuning on single- and multi-reference tasks, together with slot-wise binding regularization to limit cross-reference entanglement.

What carries the argument

Early fusion of ViT and VAE features before VLM encoding via a lightweight linear layer, which creates unified hidden states that carry both semantics and appearance for each reference image.

If this is right

Each reference image's identity and fine-grained appearance remain more consistently preserved across generated outputs.
Text instructions that assign specific roles to different references are followed with higher accuracy.
Attribute leakage and unintended mixing of details between references decrease in complex multi-subject scenes.
Overall compositional fidelity improves because the model maintains clearer bindings between textual descriptions and visual references.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same early-fusion pattern could be tested in video generation pipelines where temporal consistency across multiple reference subjects is required.
Replacing the linear fusion layer with a small learned network might further strengthen detail preservation while keeping the overall training budget low.
The slot-wise regularization term could be examined on datasets with larger numbers of simultaneous references to check whether binding quality scales.

Load-bearing premise

That early fusion of ViT and VAE features inside the VLM hidden states will simultaneously preserve semantic grounding and low-level appearance details without introducing new entanglement or training instability.

What would settle it

Training and evaluating the unified model on the two multi-reference generation benchmarks and observing no measurable gains in subject consistency, instruction following, or compositional fidelity over the decoupled ViT-plus-VAE baselines would falsify the claimed benefit of early fusion.

Figures

Figures reproduced from arXiv: 2605.12088 by Fuli Feng, Kun Gai, Pengfei Wan, Qiulin Wang, Wenjie Wang, Xintao Wang, Yiyan Xu, Yunyao Mao.

**Figure 1.** Figure 1: Illustration of decoupled and unified visual conditioning. Recent VLM-enhanced diffusion models (e.g., OmniGen2 [36], Qwen-Image-Edit [35], LongCat-ImageEdit [29]) provide a promising framework for this task by leveraging the multimodal understanding and instruction-following ability of Vision-Language Models (VLMs). As illustrated in [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗

**Figure 2.** Figure 2: Overview of UniCustom. UniCustom fuses ViT and VAE features before VLM encoding, producing semantically addressable and appearance-aware hidden states for DiT generation. To summarize, our contributions are as follows: • We identify the grounding–binding gap in VLM-enhanced diffusion models for multi-reference image generation. In existing decoupled conditioning designs, the DiT must implicitly associate V… view at source ↗

**Figure 3.** Figure 3: Two-stage training strategy. The first stage progressively learns a unified visual representation that supports fine-grained reference encoding, semantic grounding, and reliable textual-to-visual binding through reconstruction-oriented multi-image pretraining. The second stage further adapts the diffusion backbone to reference-based image generation, enabling instruction-following synthesis with single or… view at source ↗

**Figure 4.** Figure 4: Qualitative comparison on OmniContext [36]. 3.2 Main Results [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: Qualitative comparison on MICo-Bench [34]. Qualitative evaluation. We provide qualitative comparisons with competitive baselines on OmniContext and MICo-Bench in [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 6.** Figure 6: Attention visualization of UniCustom. To further analyze how UniCustom leverages multiple visual references, we visualize the internal attention maps of the DiT in multi-reference image generation. As shown in [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗

**Figure 7.** Figure 7: Effect of slot-wise binding regularization, where “Recon.” and “Local.” denote “Reconstruction” and “Localization”, respectively. Original IMG Reconstructed Reconstructed Reconstructed Binding Gap! (a) Single-image Recon. (b) Multi-image Recon. & Local. (c) Multi-image Tiling (d) Slot-wise Binding Loss [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗

**Figure 8.** Figure 8: Effect of different fusion strategies, where “Recon.” and “Local.” denote “Reconstruction” and “Localization”, respectively. 3.4 Ablation Study We conduct ablations in the pretraining stage to validate two key design choices: slot-wise binding regularization and early fusion. We report single-image reconstruction Signal-to-Noise Ratio (PSNR) [12], and multi-image reconstruction, localization, and tiling a… view at source ↗

**Figure 9.** Figure 9: More generated examples of UniCustom on OmniContext [36]. Instruction: A paddling watercraft, ideal for serene waters from image 1 floats gracefully on a tranquil lake, while a moisturizing, oil-infused bath bomb for skincare from image 2 rests on the sun- warmed shoreline. Nearby, a traditional formal outfit for men from image 3 is elegantly draped over a rock, creating an intriguing juxtaposition. Instru… view at source ↗

**Figure 10.** Figure 10: More generated examples of UniCustom on MICo-Bench [34]. 14 [PITH_FULL_IMAGE:figures/full_fig_p014_10.png] view at source ↗

**Figure 11.** Figure 11: Generated examples of UniCustom on image localization. Image editing and text-to-image generation. We further present qualitative results of UniCustom on image editing and text-to-image generation in Figures 12 and 13, respectively. Although UniCustom incorporates only a small amount of image editing and text-to-image data as auxiliary tasks to improve instruction adherence, it demonstrates strong genera… view at source ↗

**Figure 12.** Figure 12: Generated examples of UniCustom on image editing. 15 [PITH_FULL_IMAGE:figures/full_fig_p015_12.png] view at source ↗

**Figure 13.** Figure 13: Generated examples of UniCustom on text-to-image generation. D Limitations and Future Work UniCustom achieves strong multi-reference image generation performance among open-source models, demonstrating the promise of unified visual conditioning for reference-based generation. Nevertheless, a gap remains compared with closed-source systems, particularly in highly realistic human identity preservation, comp… view at source ↗

read the original abstract

Multi-reference image generation aims to synthesize images from textual instructions while faithfully preserving subject identities from multiple reference images. Existing VLM-enhanced diffusion models commonly rely on decoupled visual conditioning: semantic ViT features are processed by the VLM for instruction understanding, whereas appearance-rich VAE features are injected later into the diffusion backbone. Despite its intuitive design, this separation makes it difficult for the model to associate each semantically grounded subject with visual details from the correct reference image. As a result, the model may recognize which subject is being referred to, but fail to preserve its identity and fine-grained appearance, leading to attribute leakage and cross-reference confusion in complex multi-reference settings. To address this issue, we propose UniCustom, a unified visual conditioning framework that fuses ViT and VAE features before VLM encoding. This early fusion exposes the VLM to both semantic cues and appearance-rich details, enabling its hidden states to jointly encode the referred subject and corresponding visual appearance with only a lightweight linear fusion layer. To learn such unified representations, we adopt a two-stage training strategy: reconstruction-oriented pretraining that preserves reference-specific appearance details in the fused hidden states, followed by supervised finetuning on single- and multi-reference generation tasks. We further introduce a slot-wise binding regularization that encourages each image slot to preserve low-level details of its corresponding reference, thereby reducing cross-reference entanglement. Experiments on two multi-reference generation benchmarks demonstrate that UniCustom consistently improves subject consistency, instruction following, and compositional fidelity over strong baselines.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

UniCustom's early ViT-VAE fusion before the VLM is a reasonable fix for cross-reference confusion, but the gains may come more from the two-stage training and slot regularization than from the fusion itself.

read the letter

The paper targets a concrete issue in multi-reference image generation: decoupled conditioning lets the VLM handle semantics while VAE features arrive later, so the model often loses which details belong to which reference. UniCustom fuses the two feature types with a linear layer right before VLM encoding, then runs reconstruction pretraining followed by supervised fine-tuning and adds a slot-wise binding regularizer. That combination is presented as new, and the motivation is clear from the abstract and method description. The approach is straightforward and directly attacks the stated failure mode of attribute leakage and entanglement. Credit to the authors for keeping the fusion lightweight and for spelling out the training stages instead of burying them. The experiments claim better subject consistency and compositional fidelity on two benchmarks, which would matter for practical tools if the numbers hold. The soft spot is the attribution. The stress-test note is right to flag that the baselines may not have received equivalent two-stage training or the same regularization. Without ablations that turn the fusion on and off while holding the rest fixed, it is difficult to know how much the early fusion actually drives the reported improvements versus the auxiliary recipe. The paper does not appear to contain circular math or invented entities, and the regularization term is a standard supervised addition rather than a load-bearing assumption. This work is aimed at researchers and engineers already building controllable diffusion systems for design or entertainment use cases. A reader who needs a concrete recipe for multi-subject consistency would find the architecture and training details useful to try, even if they later run their own controls. It is worth sending to peer review because the problem is well-posed, the method is reproducible from the description, and the results are presented on standard benchmarks; a referee can ask for the missing ablations without starting from scratch.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes UniCustom for multi-reference image generation, which fuses ViT semantic features and VAE appearance features early via a lightweight linear layer before VLM encoding. It employs a two-stage training procedure (reconstruction-oriented pretraining followed by supervised finetuning on single- and multi-reference tasks) plus slot-wise binding regularization to preserve low-level details and reduce cross-reference entanglement. The central claim, supported by experiments on two benchmarks, is that this unified conditioning yields consistent gains in subject consistency, instruction following, and compositional fidelity over strong baselines.

Significance. If the gains can be rigorously attributed to the early fusion mechanism rather than the auxiliary training recipe, the work would meaningfully advance VLM-enhanced diffusion models by addressing the semantic-appearance decoupling problem that leads to attribute leakage and confusion in multi-reference settings. The lightweight fusion and practical two-stage strategy are strengths that could be adopted more broadly if validated.

major comments (2)

[Experiments] Experiments section: the comparisons to baselines do not describe whether those baselines were retrained with the identical two-stage procedure and slot-wise binding regularization; without such matched controls, the reported improvements in subject consistency and compositional fidelity cannot be confidently attributed to the early ViT-VAE fusion rather than the training and regularization components.
[Method] Method section (description of unified conditioning): the claim that early fusion via the linear layer 'enables its hidden states to jointly encode the referred subject and corresponding visual appearance' lacks supporting analysis (e.g., feature visualization or entanglement metrics) showing that semantic grounding is preserved without introducing new training instability or detail loss, which is load-bearing for the weakest assumption identified in the approach.

minor comments (2)

[Abstract] Abstract: the specific names of the two multi-reference generation benchmarks and the quantitative metrics (e.g., subject consistency scores) should be stated to allow immediate assessment of the strength of the claims.
[Method] The notation for the linear fusion layer and the slot-wise regularization term should be formalized with equations for reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and will revise the manuscript to improve clarity and rigor where appropriate.

read point-by-point responses

Referee: Experiments section: the comparisons to baselines do not describe whether those baselines were retrained with the identical two-stage procedure and slot-wise binding regularization; without such matched controls, the reported improvements in subject consistency and compositional fidelity cannot be confidently attributed to the early ViT-VAE fusion rather than the training and regularization components.

Authors: We agree that matched controls would provide stronger evidence for attributing gains specifically to early fusion. The original experiments compared against published baseline implementations to reflect standard practice. The two-stage training and slot-wise regularization are integral to enabling effective unified conditioning in UniCustom. In the revision we will add a controlled ablation applying the same training recipe to at least one strong baseline (where architecture permits) and report the resulting metrics to better isolate the contribution of the fusion layer. revision: yes
Referee: Method section (description of unified conditioning): the claim that early fusion via the linear layer 'enables its hidden states to jointly encode the referred subject and corresponding visual appearance' lacks supporting analysis (e.g., feature visualization or entanglement metrics) showing that semantic grounding is preserved without introducing new training instability or detail loss, which is load-bearing for the weakest assumption identified in the approach.

Authors: We acknowledge that direct supporting analysis would strengthen the claim. While end-to-end benchmark gains provide indirect evidence that semantics and appearance are jointly encoded without catastrophic loss, we will add feature visualizations (e.g., t-SNE of fused vs. original ViT/VAE states) and simple quantitative checks (cosine similarity to reference features and training-loss stability curves) in the revised method section to demonstrate preservation of grounding and absence of new instability. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical claims rest on independent benchmarks and standard training steps

full rationale

The paper introduces an architectural modification (early ViT+VAE fusion via linear layer) plus a two-stage training recipe and slot-wise regularization, then reports empirical gains on external multi-reference benchmarks. No equations, predictions, or first-principles derivations are presented that reduce the claimed improvements to quantities defined by the same fitted parameters or self-citations. The training procedure is described as standard reconstruction pretraining followed by supervised fine-tuning; the central result is an experimental comparison rather than a closed-form identity or self-referential fit. This is a self-contained design-and-evaluation paper with no load-bearing circular steps.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 0 invented entities

The approach rests on the standard assumption that diffusion models and VLMs can be conditioned via cross-attention or feature injection; the only added elements are a lightweight linear fusion layer and a regularization term whose strength is chosen during training.

free parameters (2)

linear fusion layer weights
Trained parameters that combine ViT and VAE features; their values are learned rather than derived.
slot-wise binding regularization strength
Hyperparameter controlling how strongly each slot must preserve reference details; fitted during training.

axioms (2)

domain assumption Diffusion models can be conditioned on fused visual features without loss of semantic or appearance fidelity
Invoked when claiming that early fusion enables joint encoding of subject identity and visual details.
domain assumption Two-stage training (reconstruction then generation) converges to better multi-reference performance than joint training
Stated as the adopted strategy without proof that alternatives would fail.

pith-pipeline@v0.9.0 · 5589 in / 1445 out tokens · 32338 ms · 2026-05-13T05:51:05.454392+00:00 · methodology