i1: A Simple and Fully Open Recipe for Strong Text-to-Image Models

Boya Zeng; Gabriel Sarch; Jucheng Shen; Shu Pu; Taiming Lu; Tianze Luo; Zhuang Liu

arxiv: 2606.11289 · v1 · pith:WKGLRSTCnew · submitted 2026-06-09 · 💻 cs.CV

i1: A Simple and Fully Open Recipe for Strong Text-to-Image Models

Boya Zeng , Tianze Luo , Shu Pu , Jucheng Shen , Taiming Lu , Gabriel Sarch , Zhuang Liu This is my paper

Pith reviewed 2026-06-27 13:35 UTC · model grok-4.3

classification 💻 cs.CV

keywords text-to-image generationdiffusion modelsopen modelsmodel trainingpublic datasetsbenchmarksablation studies

0 comments

The pith

A fully open 3B-parameter text-to-image diffusion model matches leading closed models on five benchmarks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The authors run more than 300 controlled experiments totaling over 700,000 TPU hours to test modeling and data choices in text-to-image diffusion training. They identify simple effective practices such as equal weighting when mixing curated public datasets and using larger adapters on the text encoder. Applying these practices, they train i1, a 3B-parameter model on only publicly available data, which reaches performance competitive with leading closed models on GenEval, DPG, PRISM, CVTG-2K, and LongText while beating the strongest prior fully open model by 29.5 percentage points on average. The work releases the model weights, training and inference code, and data pipeline to serve as a foundation for further open research.

Core claim

Systematic ablations reveal that equal weighting of curated datasets and modestly larger text-encoder adapters are strong defaults; when these and other simple choices are applied to train a 3B-parameter diffusion model on public data alone, the resulting i1 model matches closed leaders and exceeds the best previous fully open model by 29.5 absolute percentage points across five representative benchmarks.

What carries the argument

The set of empirical findings from 300+ ablations on dataset mixing, text encoder adapters, and related design decisions that together form the i1 training recipe.

If this is right

Fully open text-to-image models can now reach performance levels previously limited to closed systems using only public resources.
The released checkpoints, code, and data pipeline enable direct replication and extension by any researcher.
Simple, non-proprietary design choices can close most of the gap to state-of-the-art performance.
Future open research in diffusion-based generation can start from a documented, high-performing baseline rather than from scratch.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Releasing the full training pipeline may allow the community to test the same recipe on new public datasets or architectures.
The emphasis on equal weighting suggests that careful curation alone, without proprietary weighting schemes, can be sufficient for strong results.
If the model generalizes beyond the five benchmarks, it could serve as a testbed for studying failure modes that closed models hide.

Load-bearing premise

The five chosen benchmarks adequately measure overall text-to-image quality without favoring the modeling or data decisions tested in the experiments.

What would settle it

A new benchmark or large-scale human study in which i1 scores substantially below leading closed models while prior open models remain unchanged would undermine the claim of competitiveness.

Figures

Figures reproduced from arXiv: 2606.11289 by Boya Zeng, Gabriel Sarch, Jucheng Shen, Shu Pu, Taiming Lu, Tianze Luo, Zhuang Liu.

**Figure 1.** Figure 1: We investigate the design space of text-to-image diffusion models to understand how modeling and data choices affect model capabilities. This exploration culminates in i1, a 3B-parameter model that performs competitively with leading models at 1024-resolution, as measured by the average percentage score across GenEval, DPG-Bench, PRISM, CVTG-2K, and LongText-Bench. We open-source our model, code, and data … view at source ↗

**Figure 2.** Figure 2: Curated showcase of i1 in general image generation (more examples in Appendix B.1). 2 [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗

**Figure 3.** Figure 3: Curated showcase of i1 in text-rendering (more examples in Appendix B.1). 3 [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗

**Figure 4.** Figure 4: High-level illustration of our final i1 model [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: High-level illustration of our baseline for controlled experiments [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 6.** Figure 6: Example images from each image dataset (more in Appendix E.1). We use 12 curated image datasets for our controlled experiments, including 7 real-image datasets, 3 synthetic datasets, and 2 text-rendering datasets. 7 [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗

**Figure 7.** Figure 7: Example prompts from benchmarks used in our controlled experiments [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗

**Figure 8.** Figure 8: Text encoders’ performance across benchmarks. Under our modeling setup, the encoder-decoder T5Gemma models outperform representative decoder-only LLM/VLMs and CLIP-style models. More results in Appendix C. We observe that instruction tuning has minimal impact (e.g., T5Gemma-2B vs. T5Gemma-2B (base)) and larger models do not necessarily perform better (e.g., T5Gemma-2B vs. T5Gemma-9B). Most importantly, enc… view at source ↗

**Figure 9.** Figure 9: The two MLPs learn different features. We obtain two sets of features for each prompt using the two MLPs, compute cosine similarity between each pair of token-level feature vectors, and visualize the distribution of mean similarity across tokens per prompt. beyond two transformer blocks yields only marginal additional gains. cross-attn single-stream dual-stream 85 86 87 88 DPG-Bench cross-attn single-stre… view at source ↗

**Figure 10.** Figure 10: Using larger adapters for the text encoder consistently improves performance [PITH_FULL_IMAGE:figures/full_fig_p011_10.png] view at source ↗

**Figure 11.** Figure 11: Long skip connections (Bao et al., 2023) can improve the performance-parameter trade-off for dual-stream models. Additional FLOPs-based analysis is in [PITH_FULL_IMAGE:figures/full_fig_p013_11.png] view at source ↗

**Figure 12.** Figure 12: Backbone family. We compare cross-attention, single-stream, and dual-stream backbones across model sizes (see [PITH_FULL_IMAGE:figures/full_fig_p013_12.png] view at source ↗

**Figure 13.** Figure 13: The choice of synthetic captioner is important for downstream text-to-image performance [PITH_FULL_IMAGE:figures/full_fig_p014_13.png] view at source ↗

**Figure 14.** Figure 14: Sequence length of ImageNet-22K captions (10K random subset) and original, repeated, and rewritten GenEval prompts under T5Gemma tokenizer. While repeating the short prompts can recover the performance, it introduces unnatural prompt structures. To address this issue, we instead use an LLM (Qwen3-4B) to rewrite the GenEval prompts using the following meta-prompt: “I have a short text-to-image prompt {pro… view at source ↗

**Figure 15.** Figure 15: Examples from models trained on ImageNet-22K caption variants and tested on GenEval prompt variants [PITH_FULL_IMAGE:figures/full_fig_p016_15.png] view at source ↗

**Figure 16.** Figure 16: Benchmark performance for single-dataset training [PITH_FULL_IMAGE:figures/full_fig_p016_16.png] view at source ↗

**Figure 17.** Figure 17: Real, synthetic, and text-rendering images are all important for model performance [PITH_FULL_IMAGE:figures/full_fig_p017_17.png] view at source ↗

**Figure 18.** Figure 18: Threshold-based weighting. By default, the sampling weight of a dataset is its number of images. We explore dataset-level balancing by capping the sampling weights for all datasets at four hand-picked thresholds. We find that lower thresholds (i.e., more even weights) generally lead to stronger performance. Given the effectiveness of equal dataset weighting (see [PITH_FULL_IMAGE:figures/full_fig_p017_18.png] view at source ↗

**Figure 19.** Figure 19: Performance change from upweighting a single dataset [PITH_FULL_IMAGE:figures/full_fig_p018_19.png] view at source ↗

**Figure 20.** Figure 20: Subsampling the ImageNet-22K dataset has little effect on performance [PITH_FULL_IMAGE:figures/full_fig_p018_20.png] view at source ↗

**Figure 21.** Figure 21: The architecture of our final i1 model. Building on an MMDiT backbone, we use a large text encoder adapter consisting of 2 transformer blocks, remove noise-conditioning (i.e., AdaLN), add long skip connections, combine both sinusoidal and RoPE positional embeddings, and share sandwich normalizations across text and image streams. In the previous sections, we explored the modeling and data designs that can… view at source ↗

**Figure 22.** Figure 22: Benchmark performance of i1 during 256-resolution pre-training [PITH_FULL_IMAGE:figures/full_fig_p020_22.png] view at source ↗

**Figure 23.** Figure 23: Example generated images at different iterations of 256-resolution training [PITH_FULL_IMAGE:figures/full_fig_p020_23.png] view at source ↗

**Figure 24.** Figure 24: Benchmark performance of i1 at 512-resolution [PITH_FULL_IMAGE:figures/full_fig_p021_24.png] view at source ↗

**Figure 25.** Figure 25: Text rendering improves substantially after 512-resolution training [PITH_FULL_IMAGE:figures/full_fig_p021_25.png] view at source ↗

**Figure 26.** Figure 26: Benchmark performance of i1 during high-resolution training stages [PITH_FULL_IMAGE:figures/full_fig_p031_26.png] view at source ↗

**Figure 27.** Figure 27: The architecture of our cross-attention baseline model [PITH_FULL_IMAGE:figures/full_fig_p032_27.png] view at source ↗

**Figure 28.** Figure 28: The architecture of the single-stream variant of our baseline model [PITH_FULL_IMAGE:figures/full_fig_p033_28.png] view at source ↗

**Figure 29.** Figure 29: The architecture of the dual-stream variant of our baseline model [PITH_FULL_IMAGE:figures/full_fig_p034_29.png] view at source ↗

**Figure 30.** Figure 30: Qualitative comparison with Stable Diffusion 3 Medium [PITH_FULL_IMAGE:figures/full_fig_p037_30.png] view at source ↗

**Figure 31.** Figure 31: Qualitative comparison with Stable Diffusion 3 Medium [PITH_FULL_IMAGE:figures/full_fig_p038_31.png] view at source ↗

**Figure 33.** Figure 33: Visual quality degrades gracefully as the number of inference steps decreases [PITH_FULL_IMAGE:figures/full_fig_p043_33.png] view at source ↗

**Figure 32.** Figure 32: Effect of the number of inference steps on model performance [PITH_FULL_IMAGE:figures/full_fig_p044_32.png] view at source ↗

**Figure 34.** Figure 34: Examples of i1’s generation failures. The left image shows a group scene in which i1 fails to generate human faces and hands with high fidelity. The right image shows a case in which i1 fails to respect the physical behavior of a mirror, producing an implausible reflection. 0 100 200 300 400 500 T5Gemma token count GenEval DPG-Bench PRISM CVTG-2K LongText-Bench [PITH_FULL_IMAGE:figures/full_fig_p045_34.png] view at source ↗

**Figure 35.** Figure 35: Prompt length distributions of the five benchmarks [PITH_FULL_IMAGE:figures/full_fig_p045_35.png] view at source ↗

**Figure 36.** Figure 36: Combining both positional embeddings results in superior performance [PITH_FULL_IMAGE:figures/full_fig_p046_36.png] view at source ↗

**Figure 37.** Figure 37: Sandwich normalization improves performance [PITH_FULL_IMAGE:figures/full_fig_p046_37.png] view at source ↗

**Figure 38.** Figure 38: Qualitative examples of VAE reconstructions on text-rich images [PITH_FULL_IMAGE:figures/full_fig_p047_38.png] view at source ↗

**Figure 39.** Figure 39: Text encoder performance with a larger adapter [PITH_FULL_IMAGE:figures/full_fig_p048_39.png] view at source ↗

**Figure 40.** Figure 40: Text encoder performance when AdaLN is removed [PITH_FULL_IMAGE:figures/full_fig_p048_40.png] view at source ↗

**Figure 41.** Figure 41: Backbone family. We compare cross-attention, single-stream, and dual-stream backbones across estimated training FLOPs for trainable modules. Consistent with [PITH_FULL_IMAGE:figures/full_fig_p049_41.png] view at source ↗

**Figure 42.** Figure 42: Long skip connections (Bao et al., 2023) improve benchmark performance across backbones. This further supports our observations on dual-stream backbones across model sizes in [PITH_FULL_IMAGE:figures/full_fig_p050_42.png] view at source ↗

**Figure 43.** Figure 43: Ablating modeling designs on the dual-stream backbone across model sizes [PITH_FULL_IMAGE:figures/full_fig_p051_43.png] view at source ↗

**Figure 44.** Figure 44: Ablating modeling designs on the dual-stream backbone across trainable model FLOPs [PITH_FULL_IMAGE:figures/full_fig_p052_44.png] view at source ↗

**Figure 45.** Figure 45: Ablating the range of layers to which long skip connections are applied [PITH_FULL_IMAGE:figures/full_fig_p054_45.png] view at source ↗

**Figure 46.** Figure 46: Performance change from upweighting a single dataset [PITH_FULL_IMAGE:figures/full_fig_p055_46.png] view at source ↗

**Figure 47.** Figure 47: Random samples of images from each training dataset [PITH_FULL_IMAGE:figures/full_fig_p057_47.png] view at source ↗

**Figure 48.** Figure 48: ImageNet-22K caption length distributions with different captioners [PITH_FULL_IMAGE:figures/full_fig_p058_48.png] view at source ↗

**Figure 49.** Figure 49: Caption sequence length across datasets for 10K random samples per dataset using the T5Gemma tokenizer. E.5 Meta-Prompt for Synthetic Captioning For most image datasets, we use the following minimal prompt for synthetic captioning: Describe the image in detail using one paragraph . For text-rendering datasets, however, ground-truth text annotations are available. To reduce hallucinations in the VLM-genera… view at source ↗

read the original abstract

Diffusion models have consistently driven progress in text-to-image generation. However, it is challenging to attribute recent progress to specific modeling and data choices: state-of-the-art open-weight models provide limited ablations, and do not disclose their training data and full training details. The research community needs fully open (weights, data, and code) models as a foundation for further research; yet existing fully open models still fall significantly short of leading models in performance. In this project, we conduct a systematic investigation of the modeling and data design choices in text-to-image diffusion training and inference with 300+ controlled experiments totaling 700K+ TPU v6e hours. Our experiments highlight several empirical findings (e.g., equal weighting is a strong default for mixing curated datasets) and simple design decisions (e.g., larger text encoder adapters improve performance with minimal added parameters) for training strong models. Guided by these insights, we train i1, a 3B-parameter text-to-image diffusion model using only publicly available datasets. i1 is competitive with leading models on five representative benchmarks (GenEval, DPG, PRISM, CVTG-2K, and LongText), and outperforms the best existing fully open model by 29.5 absolute percentage points on average. We provide the i1 checkpoints, training and inference code, and the data processing pipeline. Together, our findings and the i1 recipe establish a practical foundation for future open research in text-to-image diffusion models. Our code is available at https://github.com/zlab-princeton/i1.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper's main contribution is a released 3B open diffusion model plus its training recipe, which beats prior open baselines on the reported benchmarks, though the gains rest on a narrow set of tests.

read the letter

The punchline is that i1 and its public pipeline give the field a usable open starting point that closes much of the gap to closed models on the five benchmarks they tested. That release, plus the scale of the ablations, is the part worth paying attention to.

What stands out is the systematic run of 300+ experiments on dataset mixing and text-encoder adapters, which produced a concrete recipe they then followed to train the 3B model. Equal weighting for data mixes and larger adapters are simple findings that line up with what people have seen elsewhere, but having them documented at this scale and tied to a released model is useful. The fact that everything—weights, code, data pipeline—is public removes the usual barrier that keeps most strong models closed.

The soft spot is the evaluation. All the headline numbers come from GenEval, DPG, PRISM, CVTG-2K, and LongText. Those suites could easily reward the exact data mixes and adapter choices the experiments favored, and the paper does not show human preference data, FID on held-out distributions, or adversarial prompts that would test whether the gains hold up more broadly. The abstract mentions 700K TPU hours but gives no detail on random seeds or pre-registration, so some of the reported edges might be less stable than they look.

This is for labs and researchers who need an open baseline they can actually modify and extend rather than for people chasing the absolute latest closed-model numbers. The work shows clear empirical thinking and honest engagement with the open-model gap.

I would send it to peer review. The artifact itself is worth the time even if the performance claims need tighter validation.

Referee Report

2 major / 2 minor

Summary. The paper conducts over 300 controlled experiments (700K+ TPU hours) on modeling and data choices for text-to-image diffusion models, identifies empirical insights such as equal weighting for dataset mixing and benefits of larger text-encoder adapters, and uses these to train i1, a 3B-parameter model on only public datasets. i1 is reported competitive with leading closed models and outperforms the best prior fully open model by 29.5 absolute percentage points on average across GenEval, DPG, PRISM, CVTG-2K, and LongText; the authors release weights, code, and data pipeline.

Significance. If the empirical findings and benchmark results hold under broader scrutiny, the work supplies a practical, fully open baseline (weights, code, and pipeline) that narrows the performance gap between open and closed text-to-image systems and enables reproducible follow-on research. The explicit release of training and inference code plus the data-processing pipeline is a concrete strength that supports community verification and extension.

major comments (2)

[Abstract and evaluation description] The central performance claim (i1 competitive with closed models and +29.5 pp vs. best open model) rests exclusively on average scores across the five listed benchmarks (GenEval, DPG, PRISM, CVTG-2K, LongText). No orthogonal evaluation—human preference studies, FID on held-out distributions, or adversarial prompts outside these suites—is reported, leaving open the possibility that gains are aligned with the specific public-data mixes and design choices tested in the 300+ ablations rather than reflecting general improvements.
[Introduction / experimental methodology] The manuscript states that 300+ controlled experiments were performed, yet provides no information on whether controls were pre-registered, whether multiple random seeds were averaged, or how post-hoc selection among the many ablations was handled; this directly affects the reliability of the “empirical findings” used to justify the final i1 recipe.

minor comments (2)

Notation for the text-encoder adapter size and the precise definition of “equal weighting” for dataset mixing should be formalized with equations or pseudocode to allow exact reproduction.
[Abstract] The abstract cites “five representative benchmarks” without a brief justification of why these particular suites were chosen over alternatives (e.g., why not include standard FID or human Elo ratings).

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We respond to each major comment below, indicating planned revisions where appropriate.

read point-by-point responses

Referee: [Abstract and evaluation description] The central performance claim (i1 competitive with closed models and +29.5 pp vs. best open model) rests exclusively on average scores across the five listed benchmarks (GenEval, DPG, PRISM, CVTG-2K, LongText). No orthogonal evaluation—human preference studies, FID on held-out distributions, or adversarial prompts outside these suites—is reported, leaving open the possibility that gains are aligned with the specific public-data mixes and design choices tested in the 300+ ablations rather than reflecting general improvements.

Authors: We acknowledge that the reported results rely on the five standard benchmarks. These suites were selected because they are the most commonly used public evaluations for text-to-image models and collectively probe prompt adherence, compositionality, and long-text handling. The manuscript does not include human preference studies, additional FID scores, or adversarial prompt sets. We will add a short limitations paragraph in the revised version noting the scope of the current evaluation and the value of future orthogonal assessments, while emphasizing that the fully open release of weights, code, and pipeline enables the community to perform such studies. revision: yes
Referee: [Introduction / experimental methodology] The manuscript states that 300+ controlled experiments were performed, yet provides no information on whether controls were pre-registered, whether multiple random seeds were averaged, or how post-hoc selection among the many ablations was handled; this directly affects the reliability of the “empirical findings” used to justify the final i1 recipe.

Authors: The experiments followed a systematic, one-factor-at-a-time design as described in the experimental sections, with each ablation varying only the targeted modeling or data choice. Pre-registration is not standard for exploratory large-scale ablation studies in this area. Key experiments averaged results over three random seeds where compute permitted; full averaging across all 300+ runs was not feasible given the 700K+ TPU-hour budget. Post-hoc selection was limited by adhering to the pre-defined experimental roadmap. We will insert a new subsection clarifying the experimental protocol, seed usage, and selection criteria to improve transparency. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical results from public-data training and standard benchmarks

full rationale

The paper performs 300+ controlled experiments on public datasets, identifies empirical patterns such as equal weighting for data mixing, trains a 3B model, and reports scores on five external benchmarks (GenEval, DPG, PRISM, CVTG-2K, LongText). No equations, derivations, or fitted parameters are defined in terms of the target metrics; the central performance claim is a direct empirical outcome rather than a reduction to self-defined inputs or self-citation chains. Self-citations are absent from load-bearing steps.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no explicit free parameters, axioms, or invented entities; the work rests on standard diffusion-model training practices and empirical tuning whose details are not visible here.

pith-pipeline@v0.9.1-grok · 5834 in / 995 out tokens · 24871 ms · 2026-06-27T13:35:07.158161+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

DiffusionBench: On Holistic Evaluation of Diffusion Transformers
cs.CV 2026-06 conditional novelty 6.0

NanoGen unifies DiT training on ImageNet and T2I, reveals negative Pearson correlations (-0.377 to -0.580) in method rankings across metrics from 21 models, and motivates DiffusionBench for holistic evaluation.

Reference graph

Works this paper leans on

24 extracted references · cited by 1 Pith paper

[1]

** Sequential Object Segmentation :** Describe objects one by one in a linear fashion to prevent feature bleeding , fully defining Object A before using a spatial marker to introduce Object B
[2]

four zebras

** Anti - Fusion Counting Logic :** When specifying a count ( e . g . , " four zebras ") , explicitly state the number and mandate that each instance is visually distinct , identical in nature , but clearly separated from the others
[3]

placed directly to the left of

** Rigid Spatial Anchoring :** Define exact relative positions using positive assertions ( e . g . , " placed directly to the left of ") rather than negative constraints , locking each object to a specific geometric coordinate in the frame
[4]

separated by a clear gap

** Negative Space Enforcement :** Explicitly force clear gaps and physical distance between objects using phrases like " separated by a clear gap " or " standing distinctly apart " to ensure object detection mechanisms can isolate them
[5]

** Attribute Binding :** Tightly bind adjectives such as color , shape , and material directly to their specific nouns immediately in the sentence to avoid cross - contamination of colors or textures between distinct objects
[6]

** Sensory Feature Hallucination :** Expand short , generic inputs by hallucinating rich physical textures , specific materials , and defined lighting setups to elevate the prompt length to the optimal 75 - 150 word range
[7]

bold , standard font ,

** Short - Text Typography :** For short quoted text , explicitly instruct the model to render it using terms like " bold , standard font ," " highly legible typography ," and " flat , undistorted lettering ."
[8]

** Text - Background Contrast :** Enforce high optical contrast for any rendered text by specifying that the text color sharply contrasts with the solid , plain surface it is written on , ensuring high detectability
[9]

** Background Neutrality :** Keep the overarching background uncluttered , neutral , or out - of - focus to maximize the visual saliency of the primary subjects and any textual elements
[10]

The Principle of Priority & Refinement

** Photographic Standardization :** Ground the scene in realistic studio or natural lighting with sharp focus ( unless an art style is specified ) , which enhances the geometric clarity of the objects and text . ### ** Mode B : Narrative - Dense Input ( Descriptive / Artistic / Long Text ) ** * Trigger :* Input is descriptive , structurally dense ( typica...
[11]

** Subject - Context Front - Loading :** Move the primary subject , main action , and the most critical textual elements to the absolute start of the prompt so they receive the highest attention weights
[12]

** Exact Text Transcription :** For long text blocks or paragraphs , transcribe the quoted content exactly 39 as provided without altering a single character , word , or punctuation mark
[13]

centered block of text ,

** Long - Text Formatting :** Describe the structural layout of long text using precise terms like " centered block of text ," " neatly aligned lines ," " clear margins ," or " bullet points " to maintain structural integrity
[14]

crisp , high - contrast , uniform lettering ,

** Enhanced Legibility Modifiers :** Boost the detectability of long , dense text by mandating " crisp , high - contrast , uniform lettering ," " even lighting across the text surface ," and " zero distortion or overlapping strokes ."
[15]

** Non - Essential Detail Trimming :** When the text to be rendered is extremely long , compress or strip away overly complex background or environmental descriptions to avoid capacity overload , ensuring the text remains the absolute visual priority
[16]

** Syntactic Decomplexing :** Break long , winding narrative sentences into punchy , independent , active - voice statements , forcing the model to render one visual concept fully before calculating the next
[17]

** Sensory Sharpening :** Translate vague , abstract , or emotional concepts into concrete , renderable physical properties , replacing poetic language with specific lighting , color , and texture instructions
[18]

** Anti - Hallucination Grounding :** Explicitly define the boundaries of the scene and do not introduce unprompted objects or extraneous elements that were not implied by the dense input graph
[19]

** Semantic Coverage Preservation :** Ensure that every distinct noun , verb , and requested attribute from the original dense input is accounted for and translated into the final rewritten output
[20]

- - - ** General Rewrite Rules :**

** Cohesive Stylistic Binding :** Reiterate the requested art style , medium , or global atmospheric lighting at the very end of the prompt to bind all the dense , disparate elements into a single cohesive image . - - - ** General Rewrite Rules :**
[21]

** Length Strategy :** Target a final output length strictly between 75 - 150 words
[22]

** Tone :** Objective , descriptive , and visually grounded
[23]

is painted on this board in bold , crisp white letters , ensuring maximum contrast and legibility . To the right , separated by a clear gap , is a solid wooden bench . The word

** Output Format :** Output exclusively the final rewritten prompt string . Do not output classification labels , reasoning , or conversational filler . - - - ** Few - Shot Examples :** ** Input ( Mode A - Spatial / Shape ) :** A triangular sign and a small sculpture ** Output :** A triangular metal road sign stands firmly on the left side of the frame . ...

1957
[24]

Describe the image by detailing the color, shape, size, texture, quantity, text, spatial relationships of the objects and background:

and Qwen-Image (Wu et al., 2025a). Overall, FLUX.2 achieves the most balanced performance across all benchmarks. Likely due to its alignment with pre-trained semantic features during training, VA-VAE achieves the strongest performance on DPG-Bench, which emphasizes semantic alignment of generated images with prompts. However, it performs much worse than t...

arXiv 2025

[1] [1]

** Sequential Object Segmentation :** Describe objects one by one in a linear fashion to prevent feature bleeding , fully defining Object A before using a spatial marker to introduce Object B

[2] [2]

four zebras

** Anti - Fusion Counting Logic :** When specifying a count ( e . g . , " four zebras ") , explicitly state the number and mandate that each instance is visually distinct , identical in nature , but clearly separated from the others

[3] [3]

placed directly to the left of

** Rigid Spatial Anchoring :** Define exact relative positions using positive assertions ( e . g . , " placed directly to the left of ") rather than negative constraints , locking each object to a specific geometric coordinate in the frame

[4] [4]

separated by a clear gap

** Negative Space Enforcement :** Explicitly force clear gaps and physical distance between objects using phrases like " separated by a clear gap " or " standing distinctly apart " to ensure object detection mechanisms can isolate them

[5] [5]

** Attribute Binding :** Tightly bind adjectives such as color , shape , and material directly to their specific nouns immediately in the sentence to avoid cross - contamination of colors or textures between distinct objects

[6] [6]

** Sensory Feature Hallucination :** Expand short , generic inputs by hallucinating rich physical textures , specific materials , and defined lighting setups to elevate the prompt length to the optimal 75 - 150 word range

[7] [7]

bold , standard font ,

** Short - Text Typography :** For short quoted text , explicitly instruct the model to render it using terms like " bold , standard font ," " highly legible typography ," and " flat , undistorted lettering ."

[8] [8]

** Text - Background Contrast :** Enforce high optical contrast for any rendered text by specifying that the text color sharply contrasts with the solid , plain surface it is written on , ensuring high detectability

[9] [9]

** Background Neutrality :** Keep the overarching background uncluttered , neutral , or out - of - focus to maximize the visual saliency of the primary subjects and any textual elements

[10] [10]

The Principle of Priority & Refinement

** Photographic Standardization :** Ground the scene in realistic studio or natural lighting with sharp focus ( unless an art style is specified ) , which enhances the geometric clarity of the objects and text . ### ** Mode B : Narrative - Dense Input ( Descriptive / Artistic / Long Text ) ** * Trigger :* Input is descriptive , structurally dense ( typica...

[11] [11]

** Subject - Context Front - Loading :** Move the primary subject , main action , and the most critical textual elements to the absolute start of the prompt so they receive the highest attention weights

[12] [12]

** Exact Text Transcription :** For long text blocks or paragraphs , transcribe the quoted content exactly 39 as provided without altering a single character , word , or punctuation mark

[13] [13]

centered block of text ,

** Long - Text Formatting :** Describe the structural layout of long text using precise terms like " centered block of text ," " neatly aligned lines ," " clear margins ," or " bullet points " to maintain structural integrity

[14] [14]

crisp , high - contrast , uniform lettering ,

** Enhanced Legibility Modifiers :** Boost the detectability of long , dense text by mandating " crisp , high - contrast , uniform lettering ," " even lighting across the text surface ," and " zero distortion or overlapping strokes ."

[15] [15]

** Non - Essential Detail Trimming :** When the text to be rendered is extremely long , compress or strip away overly complex background or environmental descriptions to avoid capacity overload , ensuring the text remains the absolute visual priority

[16] [16]

** Syntactic Decomplexing :** Break long , winding narrative sentences into punchy , independent , active - voice statements , forcing the model to render one visual concept fully before calculating the next

[17] [17]

** Sensory Sharpening :** Translate vague , abstract , or emotional concepts into concrete , renderable physical properties , replacing poetic language with specific lighting , color , and texture instructions

[18] [18]

** Anti - Hallucination Grounding :** Explicitly define the boundaries of the scene and do not introduce unprompted objects or extraneous elements that were not implied by the dense input graph

[19] [19]

** Semantic Coverage Preservation :** Ensure that every distinct noun , verb , and requested attribute from the original dense input is accounted for and translated into the final rewritten output

[20] [20]

- - - ** General Rewrite Rules :**

** Cohesive Stylistic Binding :** Reiterate the requested art style , medium , or global atmospheric lighting at the very end of the prompt to bind all the dense , disparate elements into a single cohesive image . - - - ** General Rewrite Rules :**

[21] [21]

** Length Strategy :** Target a final output length strictly between 75 - 150 words

[22] [22]

** Tone :** Objective , descriptive , and visually grounded

[23] [23]

is painted on this board in bold , crisp white letters , ensuring maximum contrast and legibility . To the right , separated by a clear gap , is a solid wooden bench . The word

** Output Format :** Output exclusively the final rewritten prompt string . Do not output classification labels , reasoning , or conversational filler . - - - ** Few - Shot Examples :** ** Input ( Mode A - Spatial / Shape ) :** A triangular sign and a small sculpture ** Output :** A triangular metal road sign stands firmly on the left side of the frame . ...

1957

[24] [24]

Describe the image by detailing the color, shape, size, texture, quantity, text, spatial relationships of the objects and background:

and Qwen-Image (Wu et al., 2025a). Overall, FLUX.2 achieves the most balanced performance across all benchmarks. Likely due to its alignment with pre-trained semantic features during training, VA-VAE achieves the strongest performance on DPG-Bench, which emphasizes semantic alignment of generated images with prompts. However, it performs much worse than t...

arXiv 2025