arxiv: 2605.14708 · v1 · submitted 2026-05-14 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

StyleTextGen: Style-Conditioned Multilingual Scene Text Generation

Zeyu Chen , Fangmin Zhao , Yan Shu , Yichao Liu , Liu Yu , Yu Zhou

Authors on Pith no claims yet

Pith reviewed 2026-05-15 04:52 UTC · model grok-4.3

classification 💻 cs.CV

keywords scene text generationstyle transfermultilingual textimage synthesisstyle consistencycomputer vision

0 comments

The pith

StyleTextGen generates scene text that matches reference visual styles across languages using a dedicated dual-branch encoder.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents StyleTextGen as a way to create text inside images that copies the exact visual appearance of reference text, including for scripts in different languages. Current approaches often fail to pull clean styles out of busy backgrounds or keep the style uniform across every character in a word. The new system adds a dual-branch encoder focused only on style, a loss term that forces style consistency, and a mask-based step at inference time to lock the output style to the input reference. These pieces together produce better results on both single-language and mixed-language cases than earlier methods. The work also releases a bilingual benchmark to measure such performance directly.

Core claim

StyleTextGen learns to perceive and replicate visual text styles across different languages and writing systems by introducing a dual-branch style encoder that yields robust multilingual representations from complex scenes, a text style consistency loss that improves coherence and visual quality, and a mask-guided inference strategy that ensures precise alignment, resulting in superior style consistency and cross-lingual generalization over prior methods.

What carries the argument

Dual-branch style encoder that isolates style modeling to produce robust multilingual text style representations from complex real-world backgrounds.

Load-bearing premise

The dual-branch style encoder and consistency loss can extract and maintain precise fine-grained text styles from complex backgrounds across languages without needing extra tuning or dataset changes.

What would settle it

Generated images on the StyleText-CE benchmark showing visible mismatches in stroke width, color, or texture between output and reference text in cross-lingual test cases would falsify the performance claim.

Figures

Figures reproduced from arXiv: 2605.14708 by Fangmin Zhao, Liu Yu, Yan Shu, Yichao Liu, Yu Zhou, Zeyu Chen.

**Figure 2.** Figure 2: Overview of StyleTextGen. (a) Training process. The inpainting input to the diffusion transformer is constructed from a scene [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Qualitative comparison on the StyleText-CE benchmark [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Qualitative results of the ablation study. The left group shows the effects of removing the Text Style Consistency Loss ( [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: Qualitative comparison on the StyleText-CE benchmark [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

read the original abstract

Style-conditioned scene text generation faces unique challenges in extracting precise text styles from complex backgrounds and maintaining fine-grained style consistency across characters, especially for multilingual scripts. We propose StyleTextGen, a novel framework that learns to perceive and replicate visual text styles across different languages and writing systems. Our approach features three key contributions: First, we introduce a dual-branch style encoder dedicated to style modeling, yielding robust multilingual text style representations in complex real-world scenes. Second, we design a text style consistency loss that enhances style coherence and improves overall visual quality. Third, we develop a mask-guided inference strategy that ensures precise style alignment between generated and reference text. To facilitate systematic evaluation, we construct StyleText-CE, a bilingual scene text style benchmark covering both monolingual and cross-lingual settings. Extensive experiments demonstrate that StyleTextGen significantly outperforms existing methods in style consistency and cross-lingual generalization, establishing new state-of-the-art performance in multilingual style-conditioned text generation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

StyleTextGen adds a dual-branch style encoder, consistency loss, and mask-guided inference for multilingual scene text, plus a new benchmark, but the SOTA claims need the actual numbers to land.

read the letter

The main contribution here is a framework that tries to pull precise visual styles from complex scene backgrounds and apply them to generated text in different languages. The dual-branch encoder separates style modeling, the consistency loss aims to keep character styles uniform, and the mask-guided inference helps with alignment during generation. They also built StyleText-CE to test both monolingual and cross-lingual cases, which gives a clearer way to measure progress than scattered existing datasets. These pieces directly target the stated problems of style extraction and cross-script coherence, and the architecture choices look like reasonable extensions of existing style transfer ideas rather than a complete reinvention. If the experiments show measurable gains in style metrics and visual quality on the new benchmark, this could be handy for synthetic data pipelines in OCR or image editing. The soft spots sit mostly in the empirical claims. The abstract asserts clear outperformance and new state-of-the-art results, yet the provided details give no numbers, ablations, or dataset statistics, so it is impossible to judge how large the gains actually are or whether they hold on standard public datasets rather than just the new one. The benchmark itself is a positive step, but any paper that introduces its own test set needs to demonstrate that the improvements are not benchmark-specific. This work is aimed at researchers doing scene text synthesis and style-conditioned generation in computer vision. Readers who need practical methods for consistent multilingual text in images would get some usable ideas from the components. I would send it for peer review because the problem is well-defined and the proposed fixes are concrete enough for referees to evaluate the experiments and suggest fixes.

Referee Report

1 major / 2 minor

Summary. The manuscript introduces StyleTextGen, a framework for style-conditioned multilingual scene text generation. It features a dual-branch style encoder for robust style representations from complex backgrounds, a text style consistency loss to improve coherence, and a mask-guided inference strategy for precise alignment. The authors construct the StyleText-CE bilingual benchmark for monolingual and cross-lingual evaluation and claim that the method significantly outperforms prior work in style consistency and cross-lingual generalization, establishing new state-of-the-art results.

Significance. If the empirical results hold, the work could advance scene text generation by addressing style extraction from real-world backgrounds and cross-lingual coherence, areas that remain challenging. The introduction of StyleText-CE as a dedicated benchmark for systematic evaluation is a potentially useful contribution that could support future research in multilingual settings.

major comments (1)

[Abstract] Abstract: the central claim that StyleTextGen 'significantly outperforms existing methods' and establishes 'new state-of-the-art performance' is presented without any quantitative metrics, error bars, ablation studies, or dataset statistics. The experiments section must supply concrete numbers (e.g., style similarity scores, FID, or user-study results), baseline comparisons, and statistical validation; without them the primary empirical assertion remains unverifiable and load-bearing for the paper's contribution.

minor comments (2)

[Abstract] Abstract: the dual-branch style encoder and consistency loss are described at a high level; adding a brief architectural diagram or pseudocode would improve clarity and reproducibility.
[Abstract] Abstract: provide basic statistics for the StyleText-CE benchmark (number of images, text instances, languages covered, and train/test splits) to allow readers to assess its scope and difficulty.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on the abstract. We will revise the manuscript to make the central empirical claims more concrete by incorporating key quantitative metrics directly into the abstract while ensuring the experiments section provides full supporting details, including error bars, ablations, and statistical validation.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim that StyleTextGen 'significantly outperforms existing methods' and establishes 'new state-of-the-art performance' is presented without any quantitative metrics, error bars, ablation studies, or dataset statistics. The experiments section must supply concrete numbers (e.g., style similarity scores, FID, or user-study results), baseline comparisons, and statistical validation; without them the primary empirical assertion remains unverifiable and load-bearing for the paper's contribution.

Authors: We agree that the abstract would benefit from greater specificity. In the revised version we will add concrete numbers (e.g., style similarity score improvements of X points and FID reductions of Y points relative to the strongest baseline) while preserving the abstract's brevity. The experiments section already contains the requested elements: quantitative style-consistency and FID scores on StyleText-CE, direct comparisons against prior methods, ablation studies isolating the dual-branch encoder and text-style consistency loss, user-study results, and dataset statistics. We will further augment this section with error bars and statistical significance tests to strengthen verifiability. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper presents an empirical architecture (dual-branch style encoder, consistency loss, mask-guided inference) evaluated on a newly constructed benchmark (StyleText-CE). No equations, derivations, or parameter-fitting steps are described that reduce a claimed prediction back to its own inputs by construction. The central claims rest on experimental outperformance rather than self-referential definitions or load-bearing self-citations that would force the result. The framework is self-contained against external benchmarks and does not invoke uniqueness theorems or ansatzes from prior author work in a circular manner.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review reveals no explicit free parameters, mathematical axioms, or invented entities; all claims are high-level architectural and empirical.

pith-pipeline@v0.9.0 · 5471 in / 1170 out tokens · 55753 ms · 2026-05-15T04:52:00.033596+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

StyleText-CE benchmark... cross-lingual generalization

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

61 extracted references · 61 canonical work pages · 3 internal anchors

[1]

In- structpix2pix: Learning to follow image editing instructions

Tim Brooks, Aleksander Holynski, and Alexei A Efros. In- structpix2pix: Learning to follow image editing instructions. InProceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 18392–18402, 2023. 3

work page 2023
[2]

The devil is in fine-tuning and long-tailed prob- lems: a new benchmark for scene text detection.arXiv preprint arXiv:2505.15649, 2025

Tianjiao Cao, Jiahao Lyu, Weichao Zeng, Weimin Mu, and Yu Zhou. The devil is in fine-tuning and long-tailed prob- lems: a new benchmark for scene text detection.arXiv preprint arXiv:2505.15649, 2025. 1

work page arXiv 2025
[3]

Posta: A go-to framework for customized artistic poster gen- eration

Haoyu Chen, Xiaojie Xu, Wenbo Li, Jingjing Ren, Tian Ye, Songhua Liu, Ying-Cong Chen, Lei Zhu, and Xinchao Wang. Posta: A go-to framework for customized artistic poster gen- eration. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 28694–28704, 2025. 5

work page 2025
[4]

Textdiffuser-2: Unleashing the power of language models for text rendering.arXiv preprint arXiv:2311.16465, 2023

Jingye Chen, Yupan Huang, Tengchao Lv, Lei Cui, Qifeng Chen, and Furu Wei. Textdiffuser-2: Unleashing the power of language models for text rendering.arXiv preprint arXiv:2311.16465, 2023. 2

work page arXiv 2023
[5]

Textdiffuser: Diffusion models as text painters.arXiv preprint, abs/2305.10855, 2023

Jingye Chen, Yupan Huang, Tengchao Lv, Lei Cui, Qifeng Chen, and Furu Wei. Textdiffuser: Diffusion models as text painters.arXiv preprint, abs/2305.10855, 2023. 2

work page arXiv 2023
[6]

Internvl: Scaling up vision foundation mod- els and aligning for generic visual-linguistic tasks

Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. Internvl: Scaling up vision foundation mod- els and aligning for generic visual-linguistic tasks. InPro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 24185–24198, 2024. 6

work page 2024
[7]

Context per- ception parallel decoder for scene text recognition.IEEE Trans

Yongkun Du, Zhineng Chen, Caiyan Jia, Xiaoting Yin, Chenxia Li, Yuning Du, and Yu-Gang Jiang. Context per- ception parallel decoder for scene text recognition.IEEE Trans. Pattern Anal. Mach. Intell., 47(6):4668–4683, 2025. 1

work page 2025
[8]

Instruction-guided scene text recognition.IEEE Trans

Yongkun Du, Zhineng Chen, Yuchen Su, Caiyan Jia, and Yu- Gang Jiang. Instruction-guided scene text recognition.IEEE Trans. Pattern Anal. Mach. Intell., 47(4):2723–2738, 2025. 1

work page 2025
[9]

Recognition-synergistic scene text editing

Zhengyao Fang, Pengyuan Lyu, Jingjing Wu, Chengquan Zhang, Jun Yu, Guangming Lu, and Wenjie Pei. Recognition-synergistic scene text editing. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 13104–13113, 2025. 2

work page 2025
[10]

Im- age style transfer using convolutional neural networks

Leon A Gatys, Alexander S Ecker, and Matthias Bethge. Im- age style transfer using convolutional neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2414–2423, 2016. 3, 4

work page 2016
[11]

A token-level text image foundation model for document understanding.arXiv preprint arXiv:2503.02304,

Tongkun Guan, Zining Wang, Pei Fu, Zhengtao Guo, Wei Shen, Kai Zhou, Tiezhu Yue, Chen Duan, Hao Sun, Qianyi Jiang, et al. A token-level text image foundation model for document understanding.arXiv preprint arXiv:2503.02304,

work page arXiv
[12]

Arbitrary style transfer in real-time with adaptive instance normalization

Xun Huang and Serge Belongie. Arbitrary style transfer in real-time with adaptive instance normalization. InProceed- ings of the IEEE international conference on computer vi- sion, pages 1501–1510, 2017. 3

work page 2017
[13]

Improving diffusion models for scene text editing with dual encoders,

Jiabao Ji, Guanhua Zhang, Zhaowen Wang, Bairu Hou, Zhifei Zhang, Brian Price, and Shiyu Chang. Improving diffusion models for scene text editing with dual encoders,

work page
[14]

Perceptual losses for real-time style transfer and super-resolution

Justin Johnson, Alexandre Alahi, and Li Fei-Fei. Perceptual losses for real-time style transfer and super-resolution. In European conference on computer vision, pages 694–711. Springer, 2016. 3

work page 2016
[15]

Brushnet: A plug-and-play image inpainting model with decomposed dual-branch diffusion

Xuan Ju, Xian Liu, Xintao Wang, Yuxuan Bian, Ying Shan, and Qiang Xu. Brushnet: A plug-and-play image inpainting model with decomposed dual-branch diffusion. InEuropean Conference on Computer Vision, pages 150–168. Springer,

work page
[16]

A style-based generator architecture for generative adversarial networks

Tero Karras, Samuli Laine, and Timo Aila. A style-based generator architecture for generative adversarial networks. InProceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 4401–4410, 2019. 3

work page 2019
[17]

Analyzing and improv- ing the image quality of stylegan

Tero Karras, Samuli Laine, Miika Aittala, Janne Hellsten, Jaakko Lehtinen, and Timo Aila. Analyzing and improv- ing the image quality of stylegan. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8110–8119, 2020. 2, 3

work page 2020
[18]

Textstylebrush: transfer of text aes- thetics from a single example.IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(7):9122–9134, 2023

Praveen Krishnan, Rama Kovvuri, Guan Pang, Boris Vas- silev, and Tal Hassner. Textstylebrush: transfer of text aes- thetics from a single example.IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(7):9122–9134, 2023. 2

work page 2023
[19]

Clipstyler: Image style transfer with a single text condition

Gihyun Kwon and Jong Chul Ye. Clipstyler: Image style transfer with a single text condition. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 18062–18071, 2022. 3

work page 2022
[20]

Flux.https://github.com/ black-forest-labs/flux, 2024

Black Forest Labs. Flux.https://github.com/ black-forest-labs/flux, 2024. 1, 3, 6

work page 2024
[21]

Stylestudio: Text-driven style transfer with selective control of style elements

Mingkun Lei, Xue Song, Beier Zhu, Hao Wang, and Chi Zhang. Stylestudio: Text-driven style transfer with selective control of style elements. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 23443– 23452, 2025. 3

work page 2025
[22]

Pp-ocrv3: More attempts for the improvement of ultra lightweight OCR sys- tem.CoRR, abs/2206.03001, 2022

Chenxia Li, Weiwei Liu, Ruoyu Guo, Xiaoting Yin, Kaitao Jiang, Yongkun Du, Yuning Du, Lingfeng Zhu, Baohua Lai, Xiaoguang Hu, Dianhai Yu, and Yanjun Ma. Pp-ocrv3: More attempts for the improvement of ultra lightweight OCR sys- tem.CoRR, abs/2206.03001, 2022. 6

work page arXiv 2022
[23]

Junnan Li, Dongxu Li, Silvio Savarese, and Steven C. H. Hoi. BLIP-2: bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint, abs/2301.12597, 2023. 3

work page internal anchor Pith review Pith/arXiv arXiv 2023
[24]

First creating backgrounds then rendering texts: A new paradigm for visual text blending.arXiv preprint arXiv:2410.10168, 2024

Zhenhang Li, Yan Shu, Weichao Zeng, Dongbao Yang, and Yu Zhou. First creating backgrounds then rendering texts: A new paradigm for visual text blending.arXiv preprint arXiv:2410.10168, 2024. 2

work page arXiv 2024
[25]

Yaron Lipman, Ricky T. Q. Chen, Heli Ben-Hamu, Maxi- milian Nickel, and Matt Le. Flow matching for generative modeling, 2023. 5

work page 2023
[26]

Glyph-byt5: A customized text encoder for accurate visual text rendering.arXiv preprint arXiv:2403.09622, 2024

Zeyu Liu, Weicong Liang, Zhanhao Liang, Chong Luo, Ji Li, Gao Huang, and Yuhui Yuan. Glyph-byt5: A customized text encoder for accurate visual text rendering.arXiv preprint arXiv:2403.09622, 2024. 2

work page arXiv 2024
[27]

Decoupled weight decay regularization, 2019

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization, 2019. 6

work page 2019
[28]

Arbitrary reading order scene text spotter with local semantics guidance

Jiahao Lyu, Wei Wang, Dongbao Yang, Jinwen Zhong, and Yu Zhou. Arbitrary reading order scene text spotter with local semantics guidance. InProceedings of the AAAI Con- ference on Artificial Intelligence, pages 5919–5927, 2025. 1

work page 2025
[29]

Glyphdraw2: Automatic generation of complex glyph posters with diffusion models and large language models

Jian Ma, Yonglin Deng, Chen Chen, Nanyang Du, Haonan Lu, and Zhenyu Yang. Glyphdraw2: Automatic generation of complex glyph posters with diffusion models and large language models. InProceedings of the AAAI Conference on Artificial Intelligence, pages 5955–5963, 2025. 2

work page 2025
[30]

Calligrapher: Freestyle text image customization.arXiv preprint arXiv:2506.24123, 2025

Yue Ma, Qingyan Bai, Hao Ouyang, Ka Leong Cheng, Qi- uyu Wang, Hongyu Liu, Zichen Liu, Haofan Wang, Jingye Chen, Yujun Shen, et al. Calligrapher: Freestyle text image customization.arXiv preprint arXiv:2506.24123, 2025. 2, 5, 8

work page arXiv 2025
[31]

Dall·e3.https://openai.com/index/ dall-e-3/, 2023

OpenAI. Dall·e3.https://openai.com/index/ dall-e-3/, 2023. 2

work page 2023
[32]

SDXL: improving latent diffusion models for high-resolution image synthesis

Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas M ¨uller, Joe Penna, and Robin Rombach. SDXL: improving latent diffusion models for high-resolution image synthesis. InICLR, 2024. 1

work page 2024
[33]

Exploring stroke-level mod- ifications for scene text editing

Yadong Qu, Qingfeng Tan, Hongtao Xie, Jianjun Xu, Yuxin Wang, and Yongdong Zhang. Exploring stroke-level mod- ifications for scene text editing. InProceedings of the AAAI Conference on Artificial Intelligence, pages 2119– 2127, 2023. 2

work page 2023
[34]

Learning transferable visual models from natural language supervi- sion

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021. 3

work page 2021
[35]

Zero-shot text-to-image generation

Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea V oss, Alec Radford, Mark Chen, and Ilya Sutskever. Zero-shot text-to-image generation. InICML, pages 8821–

work page
[36]

High-resolution image syn- thesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj¨orn Ommer. High-resolution image syn- thesis with latent diffusion models. InCVPR, pages 10684– 10695, 2022. 2

work page 2022
[37]

Semantic im- age inversion and editing using rectified stochastic differen- tial equations.arXiv preprint arXiv:2410.10792, 2024

Litu Rout, Yujia Chen, Nataniel Ruiz, Constantine Carama- nis, Sanjay Shakkottai, and Wen-Sheng Chu. Semantic im- age inversion and editing using rectified stochastic differen- tial equations.arXiv preprint arXiv:2410.10792, 2024. 5

work page arXiv 2024
[38]

Stefann: scene text editor using font adaptive neural network

Prasun Roy, Saumik Bhattacharya, Subhankar Ghosh, and Umapada Pal. Stefann: scene text editor using font adaptive neural network. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13228– 13237, 2020. 2

work page 2020
[39]

Denton, Seyed Kamyar Seyed Ghasemipour, Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans, Jonathan Ho, David J

Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily L. Denton, Seyed Kamyar Seyed Ghasemipour, Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans, Jonathan Ho, David J. Fleet, and Mohammad Norouzi. Photorealistic text-to-image diffusion models with deep language understanding. InNeurIPS, 2022. 2

work page 2022
[40]

When semantics mislead vision: Mitigating large mul- timodal models hallucinations in scene text spotting and un- derstanding.arXiv preprint arXiv:2506.05551, 2025

Yan Shu, Hangui Lin, Yexin Liu, Yan Zhang, Gangyan Zeng, Yan Li, Yu Zhou, Ser-Nam Lim, Harry Yang, and Nicu Sebe. When semantics mislead vision: Mitigating large mul- timodal models hallucinations in scene text spotting and un- derstanding.arXiv preprint arXiv:2506.05551, 2025

work page arXiv 2025
[41]

Visual text processing: A com- prehensive review and unified evaluation.arXiv preprint arXiv:2504.21682, 2025

Yan Shu, Weichao Zeng, Fangmin Zhao, Zeyu Chen, Zhen- hang Li, Xiaomeng Yang, Yu Zhou, Paolo Rota, Xiang Bai, Lianwen Jin, et al. Visual text processing: A com- prehensive review and unified evaluation.arXiv preprint arXiv:2504.21682, 2025. 2

work page arXiv 2025
[42]

SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

Michael Tschannen, Alexey Gritsenko, Xiao Wang, Muham- mad Ferjad Naeem, Ibrahim Alabdulmohsin, Nikhil Parthasarathy, Talfan Evans, Lucas Beyer, Ye Xia, Basil Mustafa, et al. Siglip 2: Multilingual vision-language en- coders with improved semantic understanding, localization, and dense features.arXiv preprint arXiv:2502.14786, 2025. 4, 6

work page internal anchor Pith review Pith/arXiv arXiv 2025
[43]

Anytext: Multilingual visual text gener- ation and editing.arXiv, 2023

Yuxiang Tuo, Wangmeng Xiang, Jun-Yan He, Yifeng Geng, and Xuansong Xie. Anytext: Multilingual visual text gener- ation and editing.arXiv, 2023. 2, 5, 6

work page 2023
[44]

Anytext2: Vi- sual text generation and editing with customizable attributes,

Yuxiang Tuo, Yifeng Geng, and Liefeng Bo. Anytext2: Vi- sual text generation and editing with customizable attributes,

work page
[45]

Texture Networks: Feed-forward Synthesis of Textures and Stylized Images

Dmitry Ulyanov, Vadim Lebedev, Andrea Vedaldi, and Victor Lempitsky. Texture networks: Feed-forward syn- thesis of textures and stylized images.arXiv preprint arXiv:1603.03417, 2016. 3

work page internal anchor Pith review Pith/arXiv arXiv 2016
[46]

Rectified diffusion: Straightness is not your need in rectified flow, 2024

Fu-Yun Wang, Ling Yang, Zhaoyang Huang, Mengdi Wang, and Hongsheng Li. Rectified diffusion: Straightness is not your need in rectified flow, 2024. 3

work page 2024
[47]

Glyphmastero: A glyph encoder for high-fidelity scene text editing

Tong Wang, Ting Liu, Xiaochao Qu, Chengjing Wu, Lu- oqi Liu, and Xiaolin Hu. Glyphmastero: A glyph encoder for high-fidelity scene text editing. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 28523–28532, 2025. 2

work page 2025
[48]

Editing text in the wild

Liang Wu, Chengquan Zhang, Jiaming Liu, Junyu Han, Jing- tuo Liu, Errui Ding, and Xiang Bai. Editing text in the wild. InProceedings of the 27th ACM international conference on multimedia, pages 1500–1508, 2019. 2

work page 2019
[49]

Textflux: An ocr-free dit model for high- fidelity multilingual scene text synthesis.arXiv preprint arXiv:2505.17778, 2025

Yu Xie, Jielei Zhang, Pengyu Chen, Ziyue Wang, Weihang Wang, Longwen Gao, Peiyi Li, Huyang Sun, Qiang Zhang, Qian Qiao, et al. Textflux: An ocr-free dit model for high- fidelity multilingual scene text synthesis.arXiv preprint arXiv:2505.17778, 2025. 6, 8

work page arXiv 2025
[50]

Swaptext: Image based texts transfer in scenes

Qiangpeng Yang, Jun Huang, and Wei Lin. Swaptext: Image based texts transfer in scenes. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14700–14709, 2020. 2

work page 2020
[51]

Ipad: iterative, parallel, and diffusion-based network for scene text recog- nition.International Journal of Computer Vision, 133(8): 5589–5609, 2025

Xiaomeng Yang, Zhi Qiao, and Yu Zhou. Ipad: iterative, parallel, and diffusion-based network for scene text recog- nition.International Journal of Computer Vision, 133(8): 5589–5609, 2025. 1

work page 2025
[52]

Glyphcontrol: Glyph condi- tional control for visual text generation.arXiv preprint, abs/2305.18259, 2023

Yukang Yang, Dongnan Gui, Yuhui Yuan, Haisong Ding, Han Hu, and Kai Chen. Glyphcontrol: Glyph condi- tional control for visual text generation.arXiv preprint, abs/2305.18259, 2023. 2

work page arXiv 2023
[53]

Vidtext: Towards comprehensive evaluation for video text understanding.arXiv preprint arXiv:2505.22810,

Zhoufaran Yang, Yan Shu, Jing Wang, Zhifei Yang, Yan Zhang, Yu Li, Keyang Lu, Gangyan Zeng, Shaohui Liu, Yu Zhou, et al. Vidtext: Towards comprehensive evaluation for video text understanding.arXiv preprint arXiv:2505.22810,

work page arXiv
[54]

Ip- adapter: Text compatible image prompt adapter for text-to- image diffusion models.arXiv preprint, 2023

Hu Ye, Jun Zhang, Sibo Liu, Xiao Han, and Wei Yang. Ip- adapter: Text compatible image prompt adapter for text-to- image diffusion models.arXiv preprint, 2023. 3

work page 2023
[55]

Hi-sam: Marrying segment anything model for hierarchical text segmentation

Maoyuan Ye, Jing Zhang, Juhua Liu, Chenyu Liu, Baocai Yin, Cong Liu, Bo Du, and Dacheng Tao. Hi-sam: Marrying segment anything model for hierarchical text segmentation. IEEE Transactions on Pattern Analysis and Machine Intelli- gence, pages 1–16, 2024. 6

work page 2024
[56]

Textctrl: Diffusion-based scene text editing with prior guidance control.Advances in Neural Information Pro- cessing Systems, 37:138569–138594, 2024

Weichao Zeng, Yan Shu, Zhenhang Li, Dongbao Yang, and Yu Zhou. Textctrl: Diffusion-based scene text editing with prior guidance control.Advances in Neural Information Pro- cessing Systems, 37:138569–138594, 2024. 2

work page 2024
[57]

Inversion-based style transfer with diffusion models

Yuxin Zhang, Nisha Huang, Fan Tang, Haibin Huang, Chongyang Ma, Weiming Dong, and Changsheng Xu. Inversion-based style transfer with diffusion models. InPro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10146–10156, 2023. 3

work page 2023
[58]

Metaxas, and Praveen Krishnan

Qilong Zhangli, Jindong Jiang, Di Liu, Licheng Yu, Xi- aoliang Dai, Ankit Ramchandani, Guan Pang, Dimitris N. Metaxas, and Praveen Krishnan. Layout agnostic scene text image synthesis with diffusion models, 2024. 2

work page 2024
[59]

Udifftext: A unified frame- work for high-quality text synthesis in arbitrary images via character-aware diffusion models, 2023

Yiming Zhao and Zhouhui Lian. Udifftext: A unified frame- work for high-quality text synthesis in arbitrary images via character-aware diffusion models, 2023. 2

work page 2023
[60]

Cdistnet: Perceiving multi-domain character distance for robust text recognition.IJCV, 132(2): 300–318, 2024

Tianlun Zheng, Zhineng Chen, Shancheng Fang, Hongtao Xie, and Yu-Gang Jiang. Cdistnet: Perceiving multi-domain character distance for robust text recognition.IJCV, 132(2): 300–318, 2024. 1

work page 2024
[61]

Explicitly-decoupled text transfer with the minimized background reconstruction for scene text editing

Jianqun Zhou, Pengwen Dai, Yang Li, Manjiang Hu, and Xiaochun Cao. Explicitly-decoupled text transfer with the minimized background reconstruction for scene text editing. IEEE Transactions on Image Processing, 2024. 2

work page 2024