pith. sign in

arxiv: 2605.25347 · v1 · pith:PQRWH7PXnew · submitted 2026-05-25 · 💻 cs.CV · cs.LG

ERNIE-Image Technical Report

Pith reviewed 2026-06-29 22:52 UTC · model grok-4.3

classification 💻 cs.CV cs.LG
keywords text-to-image generationdiffusion transformerdata curation pipelinedirect preference optimizationaesthetic assessmentinstruction followingopen-source modelpost-training alignment
0
0 comments X

The pith

An 8B single-stream DiT text-to-image model closes much of the gap to commercial systems by using bottom-up data pipelines and stabilized DPO.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces ERNIE-Image as an open-source text-to-image model on an 8B DiT backbone. It claims that a bottom-up pre-training pipeline of fine-grained categorization, rich captioning, aesthetic assessment, and hierarchical sampling reduces noise while retaining long-tail concepts, and that a top-down post-training pipeline with diversified prompts and stabilized DPO better matches real user inputs and human preferences. The authors also release an efficient turbo variant, a prompt enhancer, and new aesthetic evaluation tools. If the performance gains hold, this approach shows how data quality and alignment can make open-source models competitive with closed-source ones in instruction following, text rendering, and aesthetics without larger model sizes.

Core claim

ERNIE-Image is built on an 8B single-stream DiT architecture. During pre-training a bottom-up data construction pipeline combines fine-grained image categorization, rich caption annotation, aesthetic assessment, and hierarchical sampling to reduce data noise while preserving long-tail concepts. In post-training a top-down pipeline diversifies prompt annotations and applies stabilized DPO to align outputs with human aesthetic preferences. The model is further equipped with ERNIE-Image-Turbo for 8-NFE generation using MT-DMD to limit capability drift, a lightweight Prompt Enhancer, and ERNIE-Image-Aes together with the ERNIE-Image-Aes-1K benchmark. Experiments indicate the resulting model lead

What carries the argument

The bottom-up data construction pipeline (fine-grained categorization, rich captioning, aesthetic assessment, hierarchical sampling) paired with stabilized DPO in post-training.

If this is right

  • Open-source text-to-image models can approach commercial performance levels through data curation instead of model scaling.
  • Hierarchical sampling preserves long-tail concepts that standard random sampling would discard.
  • Stabilized DPO provides a practical route to align generation outputs with human aesthetic judgments after pre-training.
  • An 8-NFE turbo variant can retain most quality while cutting inference cost when paired with drift mitigation.
  • Dedicated aesthetic models and human-annotated benchmarks enable more reliable comparison than existing proxies.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same bottom-up curation pattern could transfer to video or audio generation to reduce reliance on proprietary data.
  • Releasing the aesthetic benchmark may encourage standardized evaluation across future open models.
  • Prompt enhancers of this type could become standard tooling for turning short user intents into reliable generation inputs.
  • If the gains prove robust, similar staged pipelines might reduce the need for ever-larger base models in other generative domains.

Load-bearing premise

The claimed gains in instruction following, text rendering, and aesthetic quality come from the described data pipelines and DPO rather than from evaluation differences or data overlap with test sets.

What would settle it

An independent test on a fresh prompt set that measures instruction adherence, text rendering accuracy, and aesthetic scores and finds no measurable edge over other open-source 8B models.

Figures

Figures reproduced from arXiv: 2605.25347 by Anqi Chen, Changling Liu, Chao Han, Haoxin Zhang, Honglin Xiong, Huanai Wang, Jiakang Hu, Jianwen Yang, Jiaxiang Liu, Jinghui Duan, Jun Xia, Jun Zhang, Lin Gao, Nan Sheng, Pengyu Zou, Qian Zhang, Qiao Zhao, Qingli Kong, Qi Zhou, Quanwen Zhang, Ranjun Hua, Siqi Wang, Siyang Sun, Tianrui Zhu, Tianyu Li, Tiechao He, Xiang Zhang, Xiaolong Ma, Xiaowen Yang, Xinmin Zhang, Xueming Jiang, Xuguang Liu, Yang Wan, Yang Wu, Yan Pan, Yanzheng Lin, Yaxin Liu, Yehua Yang, Yi Liu, Yiran Ren, Yixiang Tu, Youzhi Yang, Yuehu Dong, Yunlin Liu, Yunpeng Ding, Yu Sun, Yuting Lei, Zhenyu Qian, Zhida Feng.

Figure 1
Figure 1. Figure 1: Showcases generated by ERNIE-Image. 2 [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Showcases generated by ERNIE-Image. 3 [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Screenshot from the annotation interface. Above the 2 images, there is a single line of text to [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Preview of aesthetic annotation results. [PITH_FULL_IMAGE:figures/full_fig_p010_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Each bar plot shows the probability density of scores predicted by the corresponding aesthetic [PITH_FULL_IMAGE:figures/full_fig_p012_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Qualitative comparison of prompt enhancement on three representative tasks. From left to [PITH_FULL_IMAGE:figures/full_fig_p013_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Comparison between ERNIE-Image and state-of-the-art open-source and closed-source models. [PITH_FULL_IMAGE:figures/full_fig_p017_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Comparison between ERNIE-Image and state-of-the-art open-source and closed-source models. [PITH_FULL_IMAGE:figures/full_fig_p018_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Comparison between ERNIE-Image and state-of-the-art open-source and closed-source models. [PITH_FULL_IMAGE:figures/full_fig_p019_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Comparison between ERNIE-Image and state-of-the-art open-source and closed-source models. [PITH_FULL_IMAGE:figures/full_fig_p020_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Comparison between ERNIE-Image and state-of-the-art open-source and closed-source models. [PITH_FULL_IMAGE:figures/full_fig_p021_11.png] view at source ↗
read the original abstract

We introduce ERNIE-Image, an open-source text-to-image generation model built upon an 8B single-stream DiT architecture. ERNIE-Image aims to bridge the gap between current open-source models and leading closed-source systems through more effective mining of large-scale pre-training data and improved supervision quality throughout training. During pre-training, we adopt a bottom-up data construction pipeline that combines fine-grained image categorization, rich caption annotation, aesthetic assessment, and hierarchical sampling. This strategy reduces data noise while preserving long-tail concepts and detailed real-world knowledge, providing a stronger foundation for complex generation tasks. In the post-training stage, we use a top-down data construction pipeline for high-demand scenarios, diversify prompt annotations to better match real user inputs, and apply a stabilized DPO strategy to align the model with human aesthetic preferences. We further train ERNIE-Image-Turbo for efficient 8-NFE generation and propose MT-DMD to mitigate capability drift during distillation. To make the model easier to use in practical scenarios, we equip it with a lightweight Prompt Enhancer that expands concise user intents into structured visual descriptions. In addition, we develop ERNIE-Image-Aes, an industrial-grade aesthetic model, together with ERNIE-Image-Aes-1K, a human-annotated benchmark for realistic aesthetic evaluation. Extensive qualitative and quantitative experiments show that ERNIE-Image achieves leading performance among open-source models and approaches top-tier commercial models in instruction following, text rendering, and aesthetic quality. We release the trained models and aesthetic resources to facilitate further academic research and technical progress in the AIGC community.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces ERNIE-Image, an 8B-parameter single-stream DiT text-to-image model. It describes a bottom-up pre-training data pipeline (fine-grained categorization, rich captioning, aesthetic assessment, hierarchical sampling) intended to reduce noise while retaining long-tail concepts, a top-down post-training pipeline with diversified prompts and stabilized DPO for human preference alignment, distillation to ERNIE-Image-Turbo using MT-DMD to limit capability drift, a lightweight Prompt Enhancer, and the auxiliary ERNIE-Image-Aes model plus ERNIE-Image-Aes-1K human-annotated benchmark. The central claim is that extensive qualitative and quantitative experiments demonstrate leading performance among open-source models and near-parity with top commercial systems on instruction following, text rendering, and aesthetic quality.

Significance. If the performance claims are substantiated with proper controls, the work would offer concrete, reproducible details on data-curation and alignment techniques that could help close the open-to-closed-source gap in text-to-image generation; the public release of models and aesthetic resources would constitute a direct community benefit.

major comments (2)
  1. [Abstract] Abstract: the central claim of 'leading performance among open-source models' and 'approaches top-tier commercial models' in instruction following, text rendering, and aesthetic quality is asserted without any quantitative tables, baselines, metrics, error bars, or dataset statistics, rendering the claim impossible to evaluate from the provided text.
  2. [Pre-training and post-training sections] Pre-training and post-training sections: the attribution of gains specifically to the bottom-up categorization/captioning/aesthetic/hierarchical-sampling pipeline plus stabilized DPO lacks isolating ablations or matched-data controls that would rule out confounds such as scale differences, data overlap, or benchmark construction variations.
minor comments (1)
  1. [Abstract] Abstract: 'MT-DMD' is introduced without expansion or citation.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. The abstract summarizes high-level claims whose supporting quantitative evidence appears in the Experiments section; we address both major comments below and outline targeted revisions.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central claim of 'leading performance among open-source models' and 'approaches top-tier commercial models' in instruction following, text rendering, and aesthetic quality is asserted without any quantitative tables, baselines, metrics, error bars, or dataset statistics, rendering the claim impossible to evaluate from the provided text.

    Authors: The abstract is a concise summary; the full manuscript contains a dedicated Experiments section (Section 4) with quantitative tables reporting metrics (CLIP-T, FID, OCR accuracy, human preference rates), baselines (SDXL, PixArt-α, SD3, commercial APIs), error bars from repeated evaluations, and dataset statistics for the pre-training and post-training corpora. We will revise the abstract to explicitly cross-reference these results and include one or two key headline numbers if space permits. revision: partial

  2. Referee: [Pre-training and post-training sections] Pre-training and post-training sections: the attribution of gains specifically to the bottom-up categorization/captioning/aesthetic/hierarchical-sampling pipeline plus stabilized DPO lacks isolating ablations or matched-data controls that would rule out confounds such as scale differences, data overlap, or benchmark construction variations.

    Authors: We acknowledge that fully isolating every pipeline component would strengthen causal attribution. The current manuscript reports performance against models trained on public datasets at comparable scale and includes partial controls (e.g., ablation of the aesthetic filter and hierarchical sampling on a 1B-scale proxy). Comprehensive matched-data ablations at 8B scale are computationally prohibitive; we will add an expanded discussion of potential confounds, data-overlap checks, and the available partial ablations in the revision. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical technical report with no derivations or equations

full rationale

The manuscript is a standard technical report describing an 8B DiT model, bottom-up/top-down data pipelines, stabilized DPO, distillation, and empirical results. No equations, first-principles derivations, fitted parameters renamed as predictions, or self-referential mathematical steps appear. Performance claims rest on experiments rather than any chain that reduces to inputs by construction. No load-bearing self-citations or uniqueness theorems are invoked. This is the expected non-finding for an engineering report without theoretical content.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no mathematical derivations, fitted constants, or new postulated entities; the ledger is therefore empty.

pith-pipeline@v0.9.1-grok · 5994 in / 1111 out tokens · 25061 ms · 2026-06-29T22:52:43.950618+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Qwen-Image-Flash: Beyond Objective Design

    cs.CV 2026-06 unverdicted novelty 4.0

    Empirical analysis of data, guidance, and task mixture in few-step distillation of Qwen-Image-2.0 produces the Qwen-Image-Flash model with improved performance in unified generation and editing tasks.

Reference graph

Works this paper leans on

26 extracted references · 20 canonical work pages · cited by 1 Pith paper · 12 internal anchors

  1. [1]

    Black Forest Labs

    Accessed: 2026-04-28. Black Forest Labs. FLUX.2: Frontier Visual Intelligence. https://bfl.ai/blog/flux-2 , nov

  2. [2]

    ByteDance

    Accessed: 2026-04-28. ByteDance. Seedream 4.5.https://seed.bytedance.com/en/seedream4 5,

  3. [3]

    Z-Image: An Efficient Image Generation Foundation Model with Single-Stream Diffusion Transformer

    Huanqia Cai, Sihan Cao, Ruoyi Du, Peng Gao, Steven Hoi, Zhaohui Hou, Shijie Huang, Dengyang Jiang, Xin Jin, Liangchen Li, et al. Z-image: An efficient image generation foundation model with single-stream diffusion transformer.arXiv preprint arXiv:2511.22699, 2025a. Qi Cai, Jingwen Chen, Yang Chen, Yehao Li, Fuchen Long, Yingwei Pan, Zhaofan Qiu, Yiheng Zh...

  4. [4]

    BLIP3-o: A Family of Fully Open Unified Multimodal Models-Architecture, Training and Dataset

    Jiuhai Chen, Zhiyang Xu, Xichen Pan, Yushi Hu, Can Qin, Tom Goldstein, Lifu Huang, Tianyi Zhou, Saining Xie, Silvio Savarese, et al. Blip3-o: A family of fully open unified multimodal models- architecture, training and dataset.arXiv preprint arXiv:2505.09568, 2025a. Junsong Chen, Jincheng Yu, Chongjian Ge, Lewei Yao, Enze Xie, Yue Wu, Zhongdao Wang, James...

  5. [5]

    Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling

    Xiaokang Chen, Zhiyu Wu, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, and Chong Ruan. Janus-pro: Unified multimodal understanding and generation with data and model scaling. arXiv preprint arXiv:2501.17811, 2025b. Chaorui Deng, Deyao Zhu, Kunchang Li, Chenhui Gou, Feng Li, Zeyu Wang, Shu Zhong, Weihao Yu, Xiaonan Nie, Ziang Song, et al. Eme...

  6. [6]

    Seedream 3.0 Technical Report

    Yu Gao, Lixue Gong, Qiushan Guo, Xiaoxia Hou, Zhichao Lai, Fanshi Li, Liang Li, Xiaochen Lian, Chao Liao, Liyang Liu, et al. Seedream 3.0 technical report.arXiv preprint arXiv:2504.11346,

  7. [7]

    X-omni: Reinforcement learning makes discrete autoregressive image generative models great again.arXiv preprint arXiv:2507.22058,

    Zigang Geng, Yibing Wang, Yeyao Ma, Chen Li, Yongming Rao, Shuyang Gu, Zhao Zhong, Qinglin Lu, Han Hu, Xiaosong Zhang, et al. X-omni: Reinforcement learning makes discrete autoregressive image generative models great again.arXiv preprint arXiv:2507.22058,

  8. [9]

    LoRA: Low-Rank Adaptation of Large Language Models

    URL https://arxiv.org/abs/2106.09685. Dengyang Jiang, Dongyang Liu, Zanyi Wang, Qilong Wu, Liuzhuozheng Li, Hengzhuang Li, Xin Jin, David Liu, Zhen Li, Bo Zhang, Mengmeng Wang, Steven C. H. Hoi, Peng Gao, and Harry Yang. Distribution matching distillation meets reinforcement learning.CoRR, abs/2511.13649,

  9. [10]

    URLhttps://doi.org/10.48550/arXiv.2511.13649

    doi: 10.48550/ARXIV.2511.13649. URLhttps://doi.org/10.48550/arXiv.2511.13649. Diederik P Kingma and Max Welling. Auto-encoding variational bayes.arXiv preprint arXiv:1312.6114,

  10. [11]

    Yaron Lipman, Ricky T. Q. Chen, Heli Ben-Hamu, Maximilian Nickel, and Matthew Le. Flow matching for generative modeling. InThe Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5,

  11. [12]

    Ministral 3

    URL https://openreview.net/forum?id=PqvM RDCJT9t. Alexander H Liu, Kartik Khandelwal, Sandeep Subramanian, Victor Jouault, Abhinav Rastogi, Adrien Sad´e, Alan Jeffares, Albert Jiang, Alexandre Cahill, Alexandre Gavaudan, et al. Ministral 3.arXiv preprint arXiv:2601.08584,

  12. [13]

    Decoupled DMD: CFG augmentation as the spear, distribution matching as the shield.CoRR, abs/2511.22677,

    Dongyang Liu, Peng Gao, David Liu, Ruoyi Du, Zhen Li, Qilong Wu, Xin Jin, Sihan Cao, Shifeng Zhang, Hongsheng Li, and Steven Hoi. Decoupled DMD: CFG augmentation as the spear, distribution matching as the shield.CoRR, abs/2511.22677,

  13. [14]

    Decoupled DMD: CFG augmentation as the spear, distribution matching as the shield.CoRR, abs/2511.22677,

    doi: 10.48550/ARXIV.2511.22677. URL https://doi.org/10.48550/arXiv.2511.22677. Yiyang Ma, Xingchao Liu, Xiaokang Chen, Wen Liu, Chengyue Wu, Zhiyu Wu, Zizheng Pan, Zhenda Xie, Haowei Zhang, Xingkai Yu, et al. Janusflow: Harmonizing autoregression and rectified flow for unified multimodal understanding and generation. InProceedings of the IEEE/CVF Conferen...

  14. [15]

    Scalable diffusion models with transformers

    William Peebles and Saining Xie. Scalable diffusion models with transformers. InIEEE/CVF International Conference on Computer Vision, ICCV 2023, Paris, France, October 1-6, 2023, pp. 4172–4182. IEEE,

  15. [16]

    Boxdiff: Text-to-image synthesis with training-free box-constrained diffusion

    doi: 10.1109/ICCV51070.2023.00387. URLhttps://doi.org/10.1109/ICCV51070.2023.00387. Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D. Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. In Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levin...

  16. [17]

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj¨orn Ommer

    URL http://papers.nips.cc/pap er files/paper/2023/hash/a85b405ed65c6477a4fe8302b5e06ce7-Abstract-Conference.html. Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj¨orn Ommer. High-resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10684–10695,

  17. [18]

    Seedream 4.0: Toward Next-generation Multimodal Image Generation

    Team Seedream, Yunpeng Chen, Yu Gao, Lixue Gong, Meng Guo, Qiushan Guo, Zhiyao Guo, Xiaoxia Hou, Weilin Huang, Yixuan Huang, et al. Seedream 4.0: Toward next-generation multimodal image generation.arXiv preprint arXiv:2509.20427,

  18. [19]

    Kimi K2.5: Visual Agentic Intelligence

    URLhttps://arxiv.org/abs/2602.02276. Bram Wallace, Meihua Dang, Rafael Rafailov, Linqi Zhou, Aaron Lou, Senthil Purushwalkam, Stefano Ermon, Caiming Xiong, Shafiq Joty, and Nikhil Naik. Diffusion model alignment using direct preference optimization. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8228–8238,

  19. [20]

    25 ERNIE-Image T echnical Report April 15, 2026 Wan.https://tongyi.aliyun.com/wan/,

  20. [21]

    Emu3: Next-Token Prediction is All You Need

    Xinlong Wang, Xiaosong Zhang, Zhengxiong Luo, Quan Sun, Yufeng Cui, Jinsheng Wang, Fan Zhang, Yueze Wang, Zhen Li, Qiying Yu, et al. Emu3: Next-token prediction is all you need.arXiv preprint arXiv:2409.18869,

  21. [22]

    Qwen-Image Technical Report

    Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, Kaiyuan Gao, Kun Yan, Sheng-ming Yin, Shuai Bai, Xiao Xu, Yilei Chen, et al. Qwen-image technical report.arXiv preprint arXiv:2508.02324,

  22. [23]

    Show-o: One Single Transformer to Unify Multimodal Understanding and Generation

    Jinheng Xie, Weijia Mao, Zechen Bai, David Junhao Zhang, Weihao Wang, Kevin Qinghong Lin, Yuchao Gu, Zhijie Chen, Zhenheng Yang, and Mike Zheng Shou. Show-o: One single transformer to unify multimodal understanding and generation.arXiv preprint arXiv:2408.12528,

  23. [24]

    Qwen3 Technical Report

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388,

  24. [25]

    Freeman, and Taesung Park

    Tianwei Yin, Micha¨el Gharbi, Richard Zhang, Eli Shechtman, Fr´edo Durand, William T. Freeman, and Taesung Park. One-step diffusion with distribution matching distillation.CoRR, abs/2311.18828,

  25. [26]

    Freeman, and Taesung Park

    doi: 10.48550/ARXIV.2311.18828. URLhttps://doi.org/10.48550/arXiv.2311.18828. Tianwei Yin, Micha¨el Gharbi, Taesung Park, Richard Zhang, Eli Shechtman, Fr´edo Durand, and William T. Freeman. Improved distribution matching distillation for fast image synthesis.CoRR, abs/2405.14867,

  26. [27]

    URLhttps://doi.org/10.48550/arXiv.2405.14867

    doi: 10.48550/ARXIV.2405.14867. URLhttps://doi.org/10.48550/arXiv.2405.14867. Z.ai. GLM-Image.https://github.com/zai-org/GLM-Image,