pith. machine review for the scientific record. sign in

arxiv: 2605.14708 · v1 · submitted 2026-05-14 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

StyleTextGen: Style-Conditioned Multilingual Scene Text Generation

Authors on Pith no claims yet

Pith reviewed 2026-05-15 04:52 UTC · model grok-4.3

classification 💻 cs.CV
keywords scene text generationstyle transfermultilingual textimage synthesisstyle consistencycomputer vision
0
0 comments X

The pith

StyleTextGen generates scene text that matches reference visual styles across languages using a dedicated dual-branch encoder.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents StyleTextGen as a way to create text inside images that copies the exact visual appearance of reference text, including for scripts in different languages. Current approaches often fail to pull clean styles out of busy backgrounds or keep the style uniform across every character in a word. The new system adds a dual-branch encoder focused only on style, a loss term that forces style consistency, and a mask-based step at inference time to lock the output style to the input reference. These pieces together produce better results on both single-language and mixed-language cases than earlier methods. The work also releases a bilingual benchmark to measure such performance directly.

Core claim

StyleTextGen learns to perceive and replicate visual text styles across different languages and writing systems by introducing a dual-branch style encoder that yields robust multilingual representations from complex scenes, a text style consistency loss that improves coherence and visual quality, and a mask-guided inference strategy that ensures precise alignment, resulting in superior style consistency and cross-lingual generalization over prior methods.

What carries the argument

Dual-branch style encoder that isolates style modeling to produce robust multilingual text style representations from complex real-world backgrounds.

Load-bearing premise

The dual-branch style encoder and consistency loss can extract and maintain precise fine-grained text styles from complex backgrounds across languages without needing extra tuning or dataset changes.

What would settle it

Generated images on the StyleText-CE benchmark showing visible mismatches in stroke width, color, or texture between output and reference text in cross-lingual test cases would falsify the performance claim.

Figures

Figures reproduced from arXiv: 2605.14708 by Fangmin Zhao, Liu Yu, Yan Shu, Yichao Liu, Yu Zhou, Zeyu Chen.

Figure 1
Figure 1. Figure 1: Examples of our StyleTextGen for style-conditioned [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of StyleTextGen. (a) Training process. The inpainting input to the diffusion transformer is constructed from a scene [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Qualitative comparison on the StyleText-CE benchmark [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative results of the ablation study. The left group shows the effects of removing the Text Style Consistency Loss ( [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Qualitative comparison on the StyleText-CE benchmark [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
read the original abstract

Style-conditioned scene text generation faces unique challenges in extracting precise text styles from complex backgrounds and maintaining fine-grained style consistency across characters, especially for multilingual scripts. We propose StyleTextGen, a novel framework that learns to perceive and replicate visual text styles across different languages and writing systems. Our approach features three key contributions: First, we introduce a dual-branch style encoder dedicated to style modeling, yielding robust multilingual text style representations in complex real-world scenes. Second, we design a text style consistency loss that enhances style coherence and improves overall visual quality. Third, we develop a mask-guided inference strategy that ensures precise style alignment between generated and reference text. To facilitate systematic evaluation, we construct StyleText-CE, a bilingual scene text style benchmark covering both monolingual and cross-lingual settings. Extensive experiments demonstrate that StyleTextGen significantly outperforms existing methods in style consistency and cross-lingual generalization, establishing new state-of-the-art performance in multilingual style-conditioned text generation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The manuscript introduces StyleTextGen, a framework for style-conditioned multilingual scene text generation. It features a dual-branch style encoder for robust style representations from complex backgrounds, a text style consistency loss to improve coherence, and a mask-guided inference strategy for precise alignment. The authors construct the StyleText-CE bilingual benchmark for monolingual and cross-lingual evaluation and claim that the method significantly outperforms prior work in style consistency and cross-lingual generalization, establishing new state-of-the-art results.

Significance. If the empirical results hold, the work could advance scene text generation by addressing style extraction from real-world backgrounds and cross-lingual coherence, areas that remain challenging. The introduction of StyleText-CE as a dedicated benchmark for systematic evaluation is a potentially useful contribution that could support future research in multilingual settings.

major comments (1)
  1. [Abstract] Abstract: the central claim that StyleTextGen 'significantly outperforms existing methods' and establishes 'new state-of-the-art performance' is presented without any quantitative metrics, error bars, ablation studies, or dataset statistics. The experiments section must supply concrete numbers (e.g., style similarity scores, FID, or user-study results), baseline comparisons, and statistical validation; without them the primary empirical assertion remains unverifiable and load-bearing for the paper's contribution.
minor comments (2)
  1. [Abstract] Abstract: the dual-branch style encoder and consistency loss are described at a high level; adding a brief architectural diagram or pseudocode would improve clarity and reproducibility.
  2. [Abstract] Abstract: provide basic statistics for the StyleText-CE benchmark (number of images, text instances, languages covered, and train/test splits) to allow readers to assess its scope and difficulty.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on the abstract. We will revise the manuscript to make the central empirical claims more concrete by incorporating key quantitative metrics directly into the abstract while ensuring the experiments section provides full supporting details, including error bars, ablations, and statistical validation.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central claim that StyleTextGen 'significantly outperforms existing methods' and establishes 'new state-of-the-art performance' is presented without any quantitative metrics, error bars, ablation studies, or dataset statistics. The experiments section must supply concrete numbers (e.g., style similarity scores, FID, or user-study results), baseline comparisons, and statistical validation; without them the primary empirical assertion remains unverifiable and load-bearing for the paper's contribution.

    Authors: We agree that the abstract would benefit from greater specificity. In the revised version we will add concrete numbers (e.g., style similarity score improvements of X points and FID reductions of Y points relative to the strongest baseline) while preserving the abstract's brevity. The experiments section already contains the requested elements: quantitative style-consistency and FID scores on StyleText-CE, direct comparisons against prior methods, ablation studies isolating the dual-branch encoder and text-style consistency loss, user-study results, and dataset statistics. We will further augment this section with error bars and statistical significance tests to strengthen verifiability. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper presents an empirical architecture (dual-branch style encoder, consistency loss, mask-guided inference) evaluated on a newly constructed benchmark (StyleText-CE). No equations, derivations, or parameter-fitting steps are described that reduce a claimed prediction back to its own inputs by construction. The central claims rest on experimental outperformance rather than self-referential definitions or load-bearing self-citations that would force the result. The framework is self-contained against external benchmarks and does not invoke uniqueness theorems or ansatzes from prior author work in a circular manner.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review reveals no explicit free parameters, mathematical axioms, or invented entities; all claims are high-level architectural and empirical.

pith-pipeline@v0.9.0 · 5471 in / 1170 out tokens · 55753 ms · 2026-05-15T04:52:00.033596+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

61 extracted references · 61 canonical work pages · 3 internal anchors

  1. [1]

    In- structpix2pix: Learning to follow image editing instructions

    Tim Brooks, Aleksander Holynski, and Alexei A Efros. In- structpix2pix: Learning to follow image editing instructions. InProceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 18392–18402, 2023. 3

  2. [2]

    The devil is in fine-tuning and long-tailed prob- lems: a new benchmark for scene text detection.arXiv preprint arXiv:2505.15649, 2025

    Tianjiao Cao, Jiahao Lyu, Weichao Zeng, Weimin Mu, and Yu Zhou. The devil is in fine-tuning and long-tailed prob- lems: a new benchmark for scene text detection.arXiv preprint arXiv:2505.15649, 2025. 1

  3. [3]

    Posta: A go-to framework for customized artistic poster gen- eration

    Haoyu Chen, Xiaojie Xu, Wenbo Li, Jingjing Ren, Tian Ye, Songhua Liu, Ying-Cong Chen, Lei Zhu, and Xinchao Wang. Posta: A go-to framework for customized artistic poster gen- eration. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 28694–28704, 2025. 5

  4. [4]

    Textdiffuser-2: Unleashing the power of language models for text rendering.arXiv preprint arXiv:2311.16465, 2023

    Jingye Chen, Yupan Huang, Tengchao Lv, Lei Cui, Qifeng Chen, and Furu Wei. Textdiffuser-2: Unleashing the power of language models for text rendering.arXiv preprint arXiv:2311.16465, 2023. 2

  5. [5]

    Textdiffuser: Diffusion models as text painters.arXiv preprint, abs/2305.10855, 2023

    Jingye Chen, Yupan Huang, Tengchao Lv, Lei Cui, Qifeng Chen, and Furu Wei. Textdiffuser: Diffusion models as text painters.arXiv preprint, abs/2305.10855, 2023. 2

  6. [6]

    Internvl: Scaling up vision foundation mod- els and aligning for generic visual-linguistic tasks

    Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. Internvl: Scaling up vision foundation mod- els and aligning for generic visual-linguistic tasks. InPro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 24185–24198, 2024. 6

  7. [7]

    Context per- ception parallel decoder for scene text recognition.IEEE Trans

    Yongkun Du, Zhineng Chen, Caiyan Jia, Xiaoting Yin, Chenxia Li, Yuning Du, and Yu-Gang Jiang. Context per- ception parallel decoder for scene text recognition.IEEE Trans. Pattern Anal. Mach. Intell., 47(6):4668–4683, 2025. 1

  8. [8]

    Instruction-guided scene text recognition.IEEE Trans

    Yongkun Du, Zhineng Chen, Yuchen Su, Caiyan Jia, and Yu- Gang Jiang. Instruction-guided scene text recognition.IEEE Trans. Pattern Anal. Mach. Intell., 47(4):2723–2738, 2025. 1

  9. [9]

    Recognition-synergistic scene text editing

    Zhengyao Fang, Pengyuan Lyu, Jingjing Wu, Chengquan Zhang, Jun Yu, Guangming Lu, and Wenjie Pei. Recognition-synergistic scene text editing. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 13104–13113, 2025. 2

  10. [10]

    Im- age style transfer using convolutional neural networks

    Leon A Gatys, Alexander S Ecker, and Matthias Bethge. Im- age style transfer using convolutional neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2414–2423, 2016. 3, 4

  11. [11]

    A token-level text image foundation model for document understanding.arXiv preprint arXiv:2503.02304,

    Tongkun Guan, Zining Wang, Pei Fu, Zhengtao Guo, Wei Shen, Kai Zhou, Tiezhu Yue, Chen Duan, Hao Sun, Qianyi Jiang, et al. A token-level text image foundation model for document understanding.arXiv preprint arXiv:2503.02304,

  12. [12]

    Arbitrary style transfer in real-time with adaptive instance normalization

    Xun Huang and Serge Belongie. Arbitrary style transfer in real-time with adaptive instance normalization. InProceed- ings of the IEEE international conference on computer vi- sion, pages 1501–1510, 2017. 3

  13. [13]

    Improving diffusion models for scene text editing with dual encoders,

    Jiabao Ji, Guanhua Zhang, Zhaowen Wang, Bairu Hou, Zhifei Zhang, Brian Price, and Shiyu Chang. Improving diffusion models for scene text editing with dual encoders,

  14. [14]

    Perceptual losses for real-time style transfer and super-resolution

    Justin Johnson, Alexandre Alahi, and Li Fei-Fei. Perceptual losses for real-time style transfer and super-resolution. In European conference on computer vision, pages 694–711. Springer, 2016. 3

  15. [15]

    Brushnet: A plug-and-play image inpainting model with decomposed dual-branch diffusion

    Xuan Ju, Xian Liu, Xintao Wang, Yuxuan Bian, Ying Shan, and Qiang Xu. Brushnet: A plug-and-play image inpainting model with decomposed dual-branch diffusion. InEuropean Conference on Computer Vision, pages 150–168. Springer,

  16. [16]

    A style-based generator architecture for generative adversarial networks

    Tero Karras, Samuli Laine, and Timo Aila. A style-based generator architecture for generative adversarial networks. InProceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 4401–4410, 2019. 3

  17. [17]

    Analyzing and improv- ing the image quality of stylegan

    Tero Karras, Samuli Laine, Miika Aittala, Janne Hellsten, Jaakko Lehtinen, and Timo Aila. Analyzing and improv- ing the image quality of stylegan. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8110–8119, 2020. 2, 3

  18. [18]

    Textstylebrush: transfer of text aes- thetics from a single example.IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(7):9122–9134, 2023

    Praveen Krishnan, Rama Kovvuri, Guan Pang, Boris Vas- silev, and Tal Hassner. Textstylebrush: transfer of text aes- thetics from a single example.IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(7):9122–9134, 2023. 2

  19. [19]

    Clipstyler: Image style transfer with a single text condition

    Gihyun Kwon and Jong Chul Ye. Clipstyler: Image style transfer with a single text condition. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 18062–18071, 2022. 3

  20. [20]

    Flux.https://github.com/ black-forest-labs/flux, 2024

    Black Forest Labs. Flux.https://github.com/ black-forest-labs/flux, 2024. 1, 3, 6

  21. [21]

    Stylestudio: Text-driven style transfer with selective control of style elements

    Mingkun Lei, Xue Song, Beier Zhu, Hao Wang, and Chi Zhang. Stylestudio: Text-driven style transfer with selective control of style elements. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 23443– 23452, 2025. 3

  22. [22]

    Pp-ocrv3: More attempts for the improvement of ultra lightweight OCR sys- tem.CoRR, abs/2206.03001, 2022

    Chenxia Li, Weiwei Liu, Ruoyu Guo, Xiaoting Yin, Kaitao Jiang, Yongkun Du, Yuning Du, Lingfeng Zhu, Baohua Lai, Xiaoguang Hu, Dianhai Yu, and Yanjun Ma. Pp-ocrv3: More attempts for the improvement of ultra lightweight OCR sys- tem.CoRR, abs/2206.03001, 2022. 6

  23. [23]

    Junnan Li, Dongxu Li, Silvio Savarese, and Steven C. H. Hoi. BLIP-2: bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint, abs/2301.12597, 2023. 3

  24. [24]

    First creating backgrounds then rendering texts: A new paradigm for visual text blending.arXiv preprint arXiv:2410.10168, 2024

    Zhenhang Li, Yan Shu, Weichao Zeng, Dongbao Yang, and Yu Zhou. First creating backgrounds then rendering texts: A new paradigm for visual text blending.arXiv preprint arXiv:2410.10168, 2024. 2

  25. [25]

    Yaron Lipman, Ricky T. Q. Chen, Heli Ben-Hamu, Maxi- milian Nickel, and Matt Le. Flow matching for generative modeling, 2023. 5

  26. [26]

    Glyph-byt5: A customized text encoder for accurate visual text rendering.arXiv preprint arXiv:2403.09622, 2024

    Zeyu Liu, Weicong Liang, Zhanhao Liang, Chong Luo, Ji Li, Gao Huang, and Yuhui Yuan. Glyph-byt5: A customized text encoder for accurate visual text rendering.arXiv preprint arXiv:2403.09622, 2024. 2

  27. [27]

    Decoupled weight decay regularization, 2019

    Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization, 2019. 6

  28. [28]

    Arbitrary reading order scene text spotter with local semantics guidance

    Jiahao Lyu, Wei Wang, Dongbao Yang, Jinwen Zhong, and Yu Zhou. Arbitrary reading order scene text spotter with local semantics guidance. InProceedings of the AAAI Con- ference on Artificial Intelligence, pages 5919–5927, 2025. 1

  29. [29]

    Glyphdraw2: Automatic generation of complex glyph posters with diffusion models and large language models

    Jian Ma, Yonglin Deng, Chen Chen, Nanyang Du, Haonan Lu, and Zhenyu Yang. Glyphdraw2: Automatic generation of complex glyph posters with diffusion models and large language models. InProceedings of the AAAI Conference on Artificial Intelligence, pages 5955–5963, 2025. 2

  30. [30]

    Calligrapher: Freestyle text image customization.arXiv preprint arXiv:2506.24123, 2025

    Yue Ma, Qingyan Bai, Hao Ouyang, Ka Leong Cheng, Qi- uyu Wang, Hongyu Liu, Zichen Liu, Haofan Wang, Jingye Chen, Yujun Shen, et al. Calligrapher: Freestyle text image customization.arXiv preprint arXiv:2506.24123, 2025. 2, 5, 8

  31. [31]

    Dall·e3.https://openai.com/index/ dall-e-3/, 2023

    OpenAI. Dall·e3.https://openai.com/index/ dall-e-3/, 2023. 2

  32. [32]

    SDXL: improving latent diffusion models for high-resolution image synthesis

    Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas M ¨uller, Joe Penna, and Robin Rombach. SDXL: improving latent diffusion models for high-resolution image synthesis. InICLR, 2024. 1

  33. [33]

    Exploring stroke-level mod- ifications for scene text editing

    Yadong Qu, Qingfeng Tan, Hongtao Xie, Jianjun Xu, Yuxin Wang, and Yongdong Zhang. Exploring stroke-level mod- ifications for scene text editing. InProceedings of the AAAI Conference on Artificial Intelligence, pages 2119– 2127, 2023. 2

  34. [34]

    Learning transferable visual models from natural language supervi- sion

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021. 3

  35. [35]

    Zero-shot text-to-image generation

    Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea V oss, Alec Radford, Mark Chen, and Ilya Sutskever. Zero-shot text-to-image generation. InICML, pages 8821–

  36. [36]

    High-resolution image syn- thesis with latent diffusion models

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj¨orn Ommer. High-resolution image syn- thesis with latent diffusion models. InCVPR, pages 10684– 10695, 2022. 2

  37. [37]

    Semantic im- age inversion and editing using rectified stochastic differen- tial equations.arXiv preprint arXiv:2410.10792, 2024

    Litu Rout, Yujia Chen, Nataniel Ruiz, Constantine Carama- nis, Sanjay Shakkottai, and Wen-Sheng Chu. Semantic im- age inversion and editing using rectified stochastic differen- tial equations.arXiv preprint arXiv:2410.10792, 2024. 5

  38. [38]

    Stefann: scene text editor using font adaptive neural network

    Prasun Roy, Saumik Bhattacharya, Subhankar Ghosh, and Umapada Pal. Stefann: scene text editor using font adaptive neural network. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13228– 13237, 2020. 2

  39. [39]

    Denton, Seyed Kamyar Seyed Ghasemipour, Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans, Jonathan Ho, David J

    Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily L. Denton, Seyed Kamyar Seyed Ghasemipour, Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans, Jonathan Ho, David J. Fleet, and Mohammad Norouzi. Photorealistic text-to-image diffusion models with deep language understanding. InNeurIPS, 2022. 2

  40. [40]

    When semantics mislead vision: Mitigating large mul- timodal models hallucinations in scene text spotting and un- derstanding.arXiv preprint arXiv:2506.05551, 2025

    Yan Shu, Hangui Lin, Yexin Liu, Yan Zhang, Gangyan Zeng, Yan Li, Yu Zhou, Ser-Nam Lim, Harry Yang, and Nicu Sebe. When semantics mislead vision: Mitigating large mul- timodal models hallucinations in scene text spotting and un- derstanding.arXiv preprint arXiv:2506.05551, 2025

  41. [41]

    Visual text processing: A com- prehensive review and unified evaluation.arXiv preprint arXiv:2504.21682, 2025

    Yan Shu, Weichao Zeng, Fangmin Zhao, Zeyu Chen, Zhen- hang Li, Xiaomeng Yang, Yu Zhou, Paolo Rota, Xiang Bai, Lianwen Jin, et al. Visual text processing: A com- prehensive review and unified evaluation.arXiv preprint arXiv:2504.21682, 2025. 2

  42. [42]

    SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

    Michael Tschannen, Alexey Gritsenko, Xiao Wang, Muham- mad Ferjad Naeem, Ibrahim Alabdulmohsin, Nikhil Parthasarathy, Talfan Evans, Lucas Beyer, Ye Xia, Basil Mustafa, et al. Siglip 2: Multilingual vision-language en- coders with improved semantic understanding, localization, and dense features.arXiv preprint arXiv:2502.14786, 2025. 4, 6

  43. [43]

    Anytext: Multilingual visual text gener- ation and editing.arXiv, 2023

    Yuxiang Tuo, Wangmeng Xiang, Jun-Yan He, Yifeng Geng, and Xuansong Xie. Anytext: Multilingual visual text gener- ation and editing.arXiv, 2023. 2, 5, 6

  44. [44]

    Anytext2: Vi- sual text generation and editing with customizable attributes,

    Yuxiang Tuo, Yifeng Geng, and Liefeng Bo. Anytext2: Vi- sual text generation and editing with customizable attributes,

  45. [45]

    Texture Networks: Feed-forward Synthesis of Textures and Stylized Images

    Dmitry Ulyanov, Vadim Lebedev, Andrea Vedaldi, and Victor Lempitsky. Texture networks: Feed-forward syn- thesis of textures and stylized images.arXiv preprint arXiv:1603.03417, 2016. 3

  46. [46]

    Rectified diffusion: Straightness is not your need in rectified flow, 2024

    Fu-Yun Wang, Ling Yang, Zhaoyang Huang, Mengdi Wang, and Hongsheng Li. Rectified diffusion: Straightness is not your need in rectified flow, 2024. 3

  47. [47]

    Glyphmastero: A glyph encoder for high-fidelity scene text editing

    Tong Wang, Ting Liu, Xiaochao Qu, Chengjing Wu, Lu- oqi Liu, and Xiaolin Hu. Glyphmastero: A glyph encoder for high-fidelity scene text editing. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 28523–28532, 2025. 2

  48. [48]

    Editing text in the wild

    Liang Wu, Chengquan Zhang, Jiaming Liu, Junyu Han, Jing- tuo Liu, Errui Ding, and Xiang Bai. Editing text in the wild. InProceedings of the 27th ACM international conference on multimedia, pages 1500–1508, 2019. 2

  49. [49]

    Textflux: An ocr-free dit model for high- fidelity multilingual scene text synthesis.arXiv preprint arXiv:2505.17778, 2025

    Yu Xie, Jielei Zhang, Pengyu Chen, Ziyue Wang, Weihang Wang, Longwen Gao, Peiyi Li, Huyang Sun, Qiang Zhang, Qian Qiao, et al. Textflux: An ocr-free dit model for high- fidelity multilingual scene text synthesis.arXiv preprint arXiv:2505.17778, 2025. 6, 8

  50. [50]

    Swaptext: Image based texts transfer in scenes

    Qiangpeng Yang, Jun Huang, and Wei Lin. Swaptext: Image based texts transfer in scenes. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14700–14709, 2020. 2

  51. [51]

    Ipad: iterative, parallel, and diffusion-based network for scene text recog- nition.International Journal of Computer Vision, 133(8): 5589–5609, 2025

    Xiaomeng Yang, Zhi Qiao, and Yu Zhou. Ipad: iterative, parallel, and diffusion-based network for scene text recog- nition.International Journal of Computer Vision, 133(8): 5589–5609, 2025. 1

  52. [52]

    Glyphcontrol: Glyph condi- tional control for visual text generation.arXiv preprint, abs/2305.18259, 2023

    Yukang Yang, Dongnan Gui, Yuhui Yuan, Haisong Ding, Han Hu, and Kai Chen. Glyphcontrol: Glyph condi- tional control for visual text generation.arXiv preprint, abs/2305.18259, 2023. 2

  53. [53]

    Vidtext: Towards comprehensive evaluation for video text understanding.arXiv preprint arXiv:2505.22810,

    Zhoufaran Yang, Yan Shu, Jing Wang, Zhifei Yang, Yan Zhang, Yu Li, Keyang Lu, Gangyan Zeng, Shaohui Liu, Yu Zhou, et al. Vidtext: Towards comprehensive evaluation for video text understanding.arXiv preprint arXiv:2505.22810,

  54. [54]

    Ip- adapter: Text compatible image prompt adapter for text-to- image diffusion models.arXiv preprint, 2023

    Hu Ye, Jun Zhang, Sibo Liu, Xiao Han, and Wei Yang. Ip- adapter: Text compatible image prompt adapter for text-to- image diffusion models.arXiv preprint, 2023. 3

  55. [55]

    Hi-sam: Marrying segment anything model for hierarchical text segmentation

    Maoyuan Ye, Jing Zhang, Juhua Liu, Chenyu Liu, Baocai Yin, Cong Liu, Bo Du, and Dacheng Tao. Hi-sam: Marrying segment anything model for hierarchical text segmentation. IEEE Transactions on Pattern Analysis and Machine Intelli- gence, pages 1–16, 2024. 6

  56. [56]

    Textctrl: Diffusion-based scene text editing with prior guidance control.Advances in Neural Information Pro- cessing Systems, 37:138569–138594, 2024

    Weichao Zeng, Yan Shu, Zhenhang Li, Dongbao Yang, and Yu Zhou. Textctrl: Diffusion-based scene text editing with prior guidance control.Advances in Neural Information Pro- cessing Systems, 37:138569–138594, 2024. 2

  57. [57]

    Inversion-based style transfer with diffusion models

    Yuxin Zhang, Nisha Huang, Fan Tang, Haibin Huang, Chongyang Ma, Weiming Dong, and Changsheng Xu. Inversion-based style transfer with diffusion models. InPro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10146–10156, 2023. 3

  58. [58]

    Metaxas, and Praveen Krishnan

    Qilong Zhangli, Jindong Jiang, Di Liu, Licheng Yu, Xi- aoliang Dai, Ankit Ramchandani, Guan Pang, Dimitris N. Metaxas, and Praveen Krishnan. Layout agnostic scene text image synthesis with diffusion models, 2024. 2

  59. [59]

    Udifftext: A unified frame- work for high-quality text synthesis in arbitrary images via character-aware diffusion models, 2023

    Yiming Zhao and Zhouhui Lian. Udifftext: A unified frame- work for high-quality text synthesis in arbitrary images via character-aware diffusion models, 2023. 2

  60. [60]

    Cdistnet: Perceiving multi-domain character distance for robust text recognition.IJCV, 132(2): 300–318, 2024

    Tianlun Zheng, Zhineng Chen, Shancheng Fang, Hongtao Xie, and Yu-Gang Jiang. Cdistnet: Perceiving multi-domain character distance for robust text recognition.IJCV, 132(2): 300–318, 2024. 1

  61. [61]

    Explicitly-decoupled text transfer with the minimized background reconstruction for scene text editing

    Jianqun Zhou, Pengwen Dai, Yang Li, Manjiang Hu, and Xiaochun Cao. Explicitly-decoupled text transfer with the minimized background reconstruction for scene text editing. IEEE Transactions on Image Processing, 2024. 2