pith. sign in

arxiv: 2605.21090 · v1 · pith:7SW5K7XMnew · submitted 2026-05-20 · 💻 cs.CV

TextSculptor: Training and Benchmarking Scene Text Editing

Pith reviewed 2026-05-21 05:13 UTC · model grok-4.3

classification 💻 cs.CV
keywords scene text editingdata construction pipelineimage editing benchmarktext-to-image synthesisOCR verificationmultimodal evaluationpaired editing samplesbackground preservation
0
0 comments X

The pith

An automated pipeline generates 3.2 million paired samples and a four-task benchmark that raise open-source scene text editing performance.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to show that scene text editing can be improved by systematically generating large volumes of aligned training data instead of relying on scarce real examples. Current open-source models fall short on precise text changes because they lack sufficient examples that keep backgrounds unchanged while altering only the text. The authors build a pipeline that synthesizes base images, renders new text programmatically, and composites them to create source-target pairs with strong consistency. They pair this with a benchmark that scores text accuracy via OCR, visual quality via multimodal checks, and background fidelity via region similarity. A reader would care because reliable open tools for editing text in photos would reduce dependence on closed proprietary systems for design, advertising, and accessibility work.

Core claim

The central claim is that TextSculptor, built around an automated data construction pipeline and the TextSculpt-Bench, produces 1.2 million OCR-verified text-to-image samples plus 2 million editing pairs that train models to perform better on text addition, replacement, removal, and hybrid tasks while preserving visual realism and non-target areas, thereby narrowing the performance difference with proprietary systems.

What carries the argument

The automated data construction pipeline that combines text-aware image synthesis with programmatic text rendering and compositing to create naturally aligned source-target image pairs with strong background consistency.

If this is right

  • Models trained on the dataset achieve higher text accuracy on the four benchmark tasks as measured by OCR alignment.
  • Visual quality of edited regions improves while non-target areas remain closer to the original.
  • Background preservation scores rise through direct region similarity comparisons.
  • Standardized evaluation across addition, replacement, removal, and hybrid edits becomes possible for any new method.
  • The performance gap between open-source and proprietary text editing systems shrinks under the same multimodal judgment protocol.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar automated synthesis pipelines could be adapted to generate training data for other narrow image-editing domains such as object insertion or lighting changes.
  • The existence of a public benchmark with clear metrics may push more research groups to release open models rather than keeping improvements private.
  • If the data construction method scales, future work could test whether the same pairs also improve video-frame text editing consistency.
  • The approach highlights that background consistency is a separable requirement from text accuracy, which may guide loss design in other generative models.

Load-bearing premise

The synthetic samples generated by the pipeline match the distribution and quality of real-world editing requests closely enough that models trained on them improve on actual photographs without new artifacts or biases.

What would settle it

Training an open model on the 3.2 million pairs and then measuring no gain or a drop in OCR text accuracy and background similarity scores on a fresh collection of real scene photographs with natural text edits.

Figures

Figures reproduced from arXiv: 2605.21090 by Fei Yu, Heyun Chen, Jinghuan Chen, Moran Li, Qi She, Siyu Jiao, Wei Zhou, Xiaohan Lan, Yao Zhao, Yiheng Lin, Yingchen Yu, Yujie Zhong, Yunchao Wei, Zhengwei Wang, Zijian Feng.

Figure 1
Figure 1. Figure 1: Illustration of TextSculpt-Data and TextSculpt-Bench. TextSculpt-Data contains text-to [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Illustration of our automated data construction pipelines. The top stream generates high [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Qualitative evaluation example on TextSculpt-Bench. [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗
read the original abstract

Recent advances in Multimodal Large Language Models (MLLMs) and diffusion-based generative models have substantially improved prompt-driven image editing. However, scene text editing remains challenging, as it requires models to precisely modify textual content while preserving visual realism and non-target regions. Current open-source models still lag behind proprietary systems, largely due to the scarcity of high-quality training data and the lack of standardized benchmarks tailored to text editing. To address these challenges, we present TextSculptor, a comprehensive framework for data construction and evaluation of scene text editing. We first develop an automated data construction pipeline that combines text-aware image synthesis with programmatic text rendering and compositing. Based on this pipeline, we build TextSculpt-Data, a large-scale dataset containing 3.2M training samples, including 1.2M OCR-verified text-to-image samples and 2M paired text editing samples with naturally aligned source-target images and strong background consistency. We further introduce TextSculpt-Bench, a benchmark covering four fundamental text editing tasks: text addition, text replacement, text removal, and hybrid editing. To support reliable evaluation, we design a tailored protocol that measures text accuracy, visual quality, and background preservation through OCR-based text alignment, multimodal judgment, and background-region similarity. Extensive experiments show that TextSculptor improves open-source text editing performance and narrows the gap to proprietary models. The data and benchmark are available at https://github.com/linyiheng123/TextSculptor.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper introduces TextSculptor, a framework for scene text editing that includes an automated data construction pipeline combining text-aware image synthesis with programmatic text rendering and compositing. This produces TextSculpt-Data (3.2M samples: 1.2M OCR-verified text-to-image and 2M paired editing samples with aligned source-target pairs and background consistency) and TextSculpt-Bench (covering text addition, replacement, removal, and hybrid editing). Evaluation uses OCR-based text alignment, multimodal judgment, and background-region similarity. Experiments claim that models trained on this data improve open-source text editing performance and narrow the gap to proprietary models.

Significance. If the synthetic data transfers to real scenes without introducing exploitable artifacts, the large-scale dataset and tailored benchmark could meaningfully advance open-source scene text editing by addressing data scarcity. The automated pipeline for generating aligned editing pairs and the multi-metric evaluation protocol are constructive contributions that could support reproducible progress in this sub-area.

major comments (1)
  1. [§4 and §5] §4 (TextSculpt-Bench construction) and §5 (experiments): both the training pairs in TextSculpt-Data and the benchmark images are generated by the identical automated pipeline (text-aware synthesis + programmatic rendering/compositing). This creates a risk that reported gains exploit pipeline-specific regularities (e.g., consistent lighting, font compositing artifacts, background alignment statistics) rather than demonstrating generalization. The central performance claim would be strengthened by quantitative evaluation on a held-out real-scene test set or by reporting distribution-shift metrics (e.g., feature-space divergence) between synthetic and natural images.
minor comments (2)
  1. [Abstract] Abstract: quantitative metrics, ablation results, and error analysis are referenced but not reported; including at least headline numbers (e.g., OCR accuracy deltas, background similarity scores) would improve clarity.
  2. [Data and benchmark sections] The GitHub link is provided but no details on data release format, licensing, or exact train/val/test splits are given in the text; adding these would aid reproducibility.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the detailed and constructive review. The concern about potential overfitting to pipeline-specific artifacts is well-taken and directly relevant to the validity of our generalization claims. We address it point by point below.

read point-by-point responses
  1. Referee: [§4 and §5] §4 (TextSculpt-Bench construction) and §5 (experiments): both the training pairs in TextSculpt-Data and the benchmark images are generated by the identical automated pipeline (text-aware synthesis + programmatic rendering/compositing). This creates a risk that reported gains exploit pipeline-specific regularities (e.g., consistent lighting, font compositing artifacts, background alignment statistics) rather than demonstrating generalization. The central performance claim would be strengthened by quantitative evaluation on a held-out real-scene test set or by reporting distribution-shift metrics (e.g., feature-space divergence) between synthetic and natural images.

    Authors: We agree that using the same automated pipeline for both TextSculpt-Data and TextSculpt-Bench introduces a risk that performance gains may partly reflect exploitation of synthetic regularities rather than true generalization to natural scenes. While the pipeline incorporates diverse real-world elements (varied backgrounds, lighting, and fonts drawn from public sources), this does not fully eliminate the domain-gap concern. In the revised manuscript we will (1) add quantitative distribution-shift analysis (e.g., Fréchet Inception Distance and feature-space divergence computed with a pre-trained CLIP vision encoder) between our synthetic images and a collection of real scene-text images, and (2) report results on at least one held-out real-world scene-text editing test set drawn from existing public datasets. These additions will be placed in §5 and will be accompanied by a limitations paragraph acknowledging remaining synthetic-to-real gaps. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical data pipeline and benchmark with no derivation chain

full rationale

The paper describes an automated synthetic data construction pipeline (text-aware synthesis + programmatic rendering/compositing) to produce TextSculpt-Data (3.2M samples) and TextSculpt-Bench (four editing tasks with OCR/multimodal metrics). Performance claims rest on training models on this data and reporting empirical gains versus baselines. No equations, fitted parameters, or mathematical derivations are present that could reduce to inputs by construction. The shared generative process between train and test sets raises a legitimate generalization question (synthetic-to-real transfer), but this is a correctness or distribution-shift issue, not a circular reduction of any claimed derivation. The work is self-contained as an engineering contribution with explicit synthetic data and custom benchmark.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review yields no explicit free parameters, axioms, or invented entities; the central contribution is an empirical data pipeline whose internal hyperparameters and synthesis assumptions are not detailed here.

pith-pipeline@v0.9.0 · 5853 in / 1088 out tokens · 58198 ms · 2026-05-21T05:13:35.547999+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

39 extracted references · 39 canonical work pages · 10 internal anchors

  1. [1]

    Qwen3-VL Technical Report

    Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025

  2. [2]

    Instructpix2pix: Learning to follow image editing instructions

    Tim Brooks, Aleksander Holynski, and Alexei A Efros. Instructpix2pix: Learning to follow image editing instructions. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 18392–18402, 2023

  3. [3]

    Textdiffuser: Diffusion models as text painters.Advances in Neural Information Processing Systems, 36:9353– 9387, 2023

    Jingye Chen, Yupan Huang, Tengchao Lv, Lei Cui, Qifeng Chen, and Furu Wei. Textdiffuser: Diffusion models as text painters.Advances in Neural Information Processing Systems, 36:9353– 9387, 2023

  4. [4]

    Paddleocr 3.0 technical report, 2025

    Cheng Cui, Ting Sun, Manhui Lin, Tingquan Gao, Yubo Zhang, Jiaxuan Liu, Xueqing Wang, Zelun Zhang, Changda Zhou, Hongen Liu, Yue Zhang, Wenyu Lv, Kui Huang, Yichao Zhang, Jing Zhang, Jun Zhang, Yi Liu, Dianhai Yu, and Yanjun Ma. Paddleocr 3.0 technical report, 2025

  5. [5]

    Emu3.5: Native Multimodal Models are World Learners

    Yufeng Cui, Honghao Chen, Haoge Deng, Xu Huang, Xinghang Li, Jirong Liu, Yang Liu, Zhuoyan Luo, Jinsheng Wang, Wenxuan Wang, et al. Emu3.5: Native multimodal models are world learners.arXiv preprint arXiv:2510.26583, 2025

  6. [6]

    Nano banana pro - gemini ai image generator & photo editor

    Google DeepMind. Nano banana pro - gemini ai image generator & photo editor. https: //gemini.google/overview/image-generation/

  7. [7]

    Emerging Properties in Unified Multimodal Pretraining

    Chaorui Deng, Deyao Zhu, Kunchang Li, Chenhui Gou, Feng Li, Zeyu Wang, Shu Zhong, Weihao Yu, Xiaonan Nie, Ziang Song, et al. Emerging properties in unified multimodal pretraining.arXiv preprint arXiv:2505.14683, 2025

  8. [8]

    Textcrafter: Accurately rendering multiple texts in complex visual scenes.arXiv preprint arXiv:2503.23461, 2025

    Nikai Du, Zhennan Chen, Shan Gao, Zhizhou Chen, Xi Chen, Zhengkai Jiang, Jian Yang, and Ying Tai. Textcrafter: Accurately rendering multiple texts in complex visual scenes.arXiv preprint arXiv:2503.23461, 2025

  9. [9]

    Datacomp: In search of the next generation of multimodal datasets.Advances in Neural Information Processing Systems, 36:27092–27112, 2023

    Samir Yitzhak Gadre, Gabriel Ilharco, Alex Fang, Jonathan Hayase, Georgios Smyrnis, Thao Nguyen, Ryan Marten, Mitchell Wortsman, Dhruba Ghosh, Jieyu Zhang, et al. Datacomp: In search of the next generation of multimodal datasets.Advances in Neural Information Processing Systems, 36:27092–27112, 2023

  10. [10]

    Seed-data-edit technical report: A hybrid dataset for instructional image editing, 2024

    Yuying Ge, Sijie Zhao, Chen Li, Yixiao Ge, and Ying Shan. Seed-data-edit technical report: A hybrid dataset for instructional image editing, 2024

  11. [11]

    X-omni: Reinforcement learning makes discrete autoregressive image generative models great again.CoRR, abs/2507.22058, 2025

    Zigang Geng, Yibing Wang, Yeyao Ma, Chen Li, Yongming Rao, Shuyang Gu, Zhao Zhong, Qinglin Lu, Han Hu, Xiaosong Zhang, et al. X-omni: Reinforcement learning makes discrete autoregressive image generative models great again.arXiv preprint arXiv:2507.22058, 2025

  12. [12]

    Texteditbench: Evaluating reasoning-aware text editing beyond rendering.arXiv preprint arXiv:2512.16270, 2025

    Rui Gui, Yang Wan, Haochen Han, Dongxing Mao, Fangming Liu, Min Li, and Alex Jinpeng Wang. Texteditbench: Evaluating reasoning-aware text editing beyond rendering.arXiv preprint arXiv:2512.16270, 2025

  13. [13]

    Smartedit: Exploring complex instruction-based image editing with multimodal large language models, 2023

    Yuzhou Huang, Liangbin Xie, Xintao Wang, Ziyang Yuan, Xiaodong Cun, Yixiao Ge, Jiantao Zhou, Chao Dong, Rui Huang, Ruimao Zhang, and Ying Shan. Smartedit: Exploring complex instruction-based image editing with multimodal large language models, 2023

  14. [14]

    Hq-edit: A high-quality dataset for instruction-based image editing, 2024

    Mude Hui, Siwei Yang, Bingchen Zhao, Yichun Shi, Heng Wang, Peng Wang, Yuyin Zhou, and Cihang Xie. Hq-edit: A high-quality dataset for instruction-based image editing, 2024

  15. [15]

    Thinkgen: Generalized thinking for visual generation.arXiv preprint arXiv:2512.23568, 2025

    Siyu Jiao, Yiheng Lin, Yujie Zhong, Qi She, Wei Zhou, Xiaohan Lan, Zilong Huang, Fei Yu, Yingchen Yu, Yunqing Zhao, et al. Thinkgen: Generalized thinking for visual generation.arXiv preprint arXiv:2512.23568, 2025

  16. [16]

    Densefusion-1m: Merging vision experts for comprehensive multimodal perception.Advances in Neural Information Processing Systems, 37:18535–18556, 2024

    Xiaotong Li, Fan Zhang, Haiwen Diao, Yueze Wang, Xinlong Wang, and Lingyu Duan. Densefusion-1m: Merging vision experts for comprehensive multimodal perception.Advances in Neural Information Processing Systems, 37:18535–18556, 2024. 10

  17. [17]

    Step1X-Edit: A Practical Framework for General Image Editing

    Shiyu Liu, Yucheng Han, Peng Xing, Fukun Yin, Rui Wang, Wei Cheng, Jiaqi Liao, Yingming Wang, Honghao Fu, Chunrui Han, et al. Step1x-edit: A practical framework for general image editing.arXiv preprint arXiv:2504.17761, 2025

  18. [18]

    Gpt-5.2.https://openai.com/index/introducing-gpt-5-2/

    OpenAI. Gpt-5.2.https://openai.com/index/introducing-gpt-5-2/

  19. [19]

    Introducing 4o image generation, 2025

    OpenAI. Introducing 4o image generation, 2025

  20. [20]

    Pico-banana-400k: A large-scale dataset for text-guided image editing, 2025

    Yusu Qian, Eli Bocek-Rivele, Liangchen Song, Jialing Tong, Yinfei Yang, Jiasen Lu, Wenze Hu, and Zhe Gan. Pico-banana-400k: A large-scale dataset for text-guided image editing.arXiv preprint arXiv:2510.19808, 2025

  21. [21]

    Seedream 4.0: Toward Next-generation Multimodal Image Generation

    Team Seedream, Yunpeng Chen, Yu Gao, Lixue Gong, Meng Guo, Qiushan Guo, Zhiyao Guo, Xiaoxia Hou, Weilin Huang, Yixuan Huang, et al. Seedream 4.0: Toward next-generation multimodal image generation.arXiv preprint arXiv:2509.20427, 2025

  22. [22]

    LongCat-Image Technical Report

    Meituan LongCat Team, Hanghang Ma, Haoxian Tan, Jiale Huang, Junqiang Wu, Jun-Yan He, Lishuai Gao, Songlin Xiao, Xiaoming Wei, Xiaoqi Ma, Xunliang Cai, Yayong Guan, and Jie Hu. Longcat-image technical report.arXiv preprint arXiv:2512.07584, 2025

  23. [23]

    Firered-image-edit-1.0 techinical report.arXiv preprint arXiv:2602.13344, 2026

    Super Intelligence Team, Changhao Qiao, Chao Hui, Chen Li, Cunzheng Wang, Dejia Song, Jiale Zhang, Jing Li, Qiang Xiang, Runqi Wang, et al. FireRed-Image-Edit-1.0 Technical Report. arXiv preprint arXiv:2602.13344, 2026

  24. [24]

    Z-Image: An Efficient Image Generation Foundation Model with Single-Stream Diffusion Transformer

    Z-Image Team. Z-image: An efficient image generation foundation model with single-stream diffusion transformer.arXiv preprint arXiv:2511.22699, 2025

  25. [25]

    Anytext: Multilingual visual text generation and editing.arXiv preprint arXiv:2311.03054, 2023

    Yuxiang Tuo, Wangmeng Xiang, Jun-Yan He, Yifeng Geng, and Xuansong Xie. Anytext: Multilingual visual text generation and editing.arXiv preprint arXiv:2311.03054, 2023

  26. [26]

    Textatlas5m: A large-scale dataset for dense text image generation.arXiv preprint arXiv:2502.07870, 2025

    Alex Jinpeng Wang, Dongxing Mao, Jiawei Zhang, Weiming Han, Zhuobai Dong, Linjie Li, Yiqi Lin, Zhengyuan Yang, Libo Qin, Fuwei Zhang, et al. Textatlas5m: A large-scale dataset for dense text image generation.arXiv preprint arXiv:2502.07870, 2025

  27. [27]

    Instructedit: Improving automatic masks for diffusion-based image editing with user instructions, 2023

    Qian Wang, Biao Zhang, Michael Birsak, and Peter Wonka. Instructedit: Improving automatic masks for diffusion-based image editing with user instructions, 2023

  28. [28]

    Gpt-image-edit-1.5 m: A million-scale, gpt-generated image dataset.arXiv preprint arXiv:2507.21033, 2025

    Yuhan Wang, Siwei Yang, Bingchen Zhao, Letian Zhang, Qing Liu, Yuyin Zhou, and Cihang Xie. Gpt-image-edit-1.5 m: A million-scale, gpt-generated image dataset.arXiv preprint arXiv:2507.21033, 2025

  29. [29]

    Qwen-Image Technical Report

    Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, Kaiyuan Gao, Kun Yan, Sheng-ming Yin, Shuai Bai, Xiao Xu, Yilei Chen, et al. Qwen-Image Technical Report.arXiv preprint arXiv:2508.02324, 2025

  30. [30]

    OmniGen2: Towards Instruction-Aligned Multimodal Generation

    Chenyuan Wu, Pengfei Zheng, Ruiran Yan, Shitao Xiao, Xin Luo, Yueze Wang, Wanli Li, Xiyan Jiang, Yexin Liu, Junjie Zhou, et al. Omnigen2: Exploration to advanced multimodal generation.arXiv preprint arXiv:2506.18871, 2025

  31. [31]

    Omnigen: Unified image generation, 2024

    Shitao Xiao, Yueze Wang, Junjie Zhou, Huaying Yuan, Xingrun Xing, Ruiran Yan, Chaofan Li, Shuting Wang, Tiejun Huang, and Zheng Liu. Omnigen: Unified image generation, 2024

  32. [32]

    Echo-4o: Harnessing the power of gpt-4o synthetic images for improved image generation.arXiv preprint arXiv:2508.09987, 2025

    Junyan Ye, Dongzhi Jiang, Zihao Wang, Leqi Zhu, Zhenghao Hu, Zilong Huang, Jun He, Zhiyuan Yan, Jinghua Yu, Hongsheng Li, et al. Echo-4o: Harnessing the power of gpt-4o synthetic images for improved image generation.arXiv preprint arXiv:2508.09987, 2025

  33. [33]

    ImgEdit: A Unified Image Editing Dataset and Benchmark

    Yang Ye, Xianyi He, Zongjian Li, Bin Lin, Shenghai Yuan, Zhiyuan Yan, Bohan Hou, and Li Yuan. Imgedit: A unified image editing dataset and benchmark.arXiv preprint arXiv:2505.20275, 2025

  34. [34]

    Anyedit: Mastering unified high-quality image editing for any idea

    Qifan Yu, Wei Chow, Zhongqi Yue, Kaihang Pan, Yang Wu, Xiaoyang Wan, Juncheng Li, Siliang Tang, Hanwang Zhang, and Yueting Zhuang. Anyedit: Mastering unified high-quality image editing for any idea. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 26125–26135, 2025. 11

  35. [35]

    Weedit: A dataset, benchmark and glyph-guided framework for text-centric image editing

    Hui Zhang, Juntao Liu, Zongkai Liu, Liqiang Niu, Fandong Meng, Zuxuan Wu, and Yu-Gang Jiang. Weedit: A dataset, benchmark and glyph-guided framework for text-centric image editing. arXiv preprint arXiv:2603.11593, 2026

  36. [36]

    Magicbrush: A manually annotated dataset for instruction-guided image editing, 2024

    Kai Zhang, Lingbo Mo, Wenhu Chen, Huan Sun, and Yu Su. Magicbrush: A manually annotated dataset for instruction-guided image editing, 2024

  37. [37]

    Ultraedit: Instruction-based fine-grained image editing at scale, 2024

    Haozhe Zhao, Xiaojian Ma, Liang Chen, Shuzheng Si, Rujie Wu, Kaikai An, Peiyu Yu, Minjia Zhang, Qing Li, and Baobao Chang. Ultraedit: Instruction-based fine-grained image editing at scale, 2024

  38. [38]

    Lex-art: Rethinking text generation via scalable high- quality data synthesis.arXiv preprint arXiv:2503.21749, 2025

    Shitian Zhao, Qilong Wu, Xinyue Li, Bo Zhang, Ming Li, Qi Qin, Dongyang Liu, Kaipeng Zhang, Hongsheng Li, Yu Qiao, et al. Lex-art: Rethinking text generation via scalable high- quality data synthesis.arXiv preprint arXiv:2503.21749, 2025

  39. [39]

    Envisioning beyond the pixels: Benchmarking reasoning-informed visual editing (risebench), 2025

    Xiangyu Zhao, Peiyuan Zhang, Kexian Tang, Hao Li, Zicheng Zhang, Guangtao Zhai, Junchi Yan, Hua Yang, Xue Yang, and Haodong Duan. Envisioning beyond the pixels: Benchmarking reasoning-informed visual editing (risebench), 2025. 12