TextWand: A Unified Framework for Scene Text Editing

Hongxiu Chen; Jian Zhang; Ronggang Wang; Shuyu Wang; Weiqi Li; Xin Shan; Yule Duan; Zhile Guan

arxiv: 2606.05730 · v1 · pith:R4AOFJSKnew · submitted 2026-06-04 · 💻 cs.CV

TextWand: A Unified Framework for Scene Text Editing

Shuyu Wang , Zhile Guan , Hongxiu Chen , Yule Duan , Weiqi Li , Xin Shan , Ronggang Wang , Jian Zhang This is my paper

Pith reviewed 2026-06-28 01:53 UTC · model grok-4.3

classification 💻 cs.CV

keywords scene text editingtext removaltext generationtext replacementunified frameworkpositional encodingerasure suppressionbenchmark dataset

0 comments

The pith

TextWand unifies scene text removal, generation and replacement in one model by splitting edits into rendering and erasure primitives.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents TextWand as a single framework that handles removal, generation and replacement of text in natural scene images. It works by reducing these tasks to two basic operations: rendering new text and erasing existing text. Two new components support this decomposition: Overlay-Reference Positional Encoding maintains exact layout and copies style from reference examples, while Region-Adaptive Suppression produces clean erasures without leftover artifacts. Because prior datasets cover only one task at a time, the authors also release TextWand-Bench. Experiments on that benchmark show higher text accuracy, layout consistency and final image quality than both open-source and closed-source alternatives.

Core claim

By decomposing complex scene-text edits into the atomic primitives of rendering and erasure, TextWand achieves precise control over text appearance and background integrity. Overlay-Reference Positional Encoding enforces pixel-level layout fidelity and exemplar-driven style control, while Region-Adaptive Suppression ensures clean text erasure. The resulting model outperforms existing leading open-source and closed-source models on text content accuracy, layout and style consistency, and overall image quality across all three editing tasks.

What carries the argument

Overlay-Reference Positional Encoding (ORPE) for pixel-level layout fidelity and exemplar-driven style control, paired with Region-Adaptive Suppression (RAS) for clean text erasure.

If this is right

One model replaces separate networks for removal, generation and replacement.
Pixel-level layout and style are preserved without extra alignment steps.
Background regions remain intact after text is erased or overwritten.
TextWand-Bench supplies the first unified test set for general-purpose scene text editing.
Superior accuracy and consistency hold across open-source and closed-source baselines.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same rendering-erasure split could be tested on other structured image elements such as logos or signs.
A single trained model reduces memory and inference cost compared with maintaining three separate editors.
If the decomposition generalizes, the framework might extend to short video clips by applying the same primitives frame by frame.
Downstream applications such as automatic sign translation or document redaction could adopt the unified model directly.

Load-bearing premise

Complex scene-text edits can be reliably decomposed into rendering and erasure without loss of fidelity or introduction of artifacts that would require task-specific post-processing.

What would settle it

A set of scene-text edits where the rendering-plus-erasure decomposition produces visible artifacts or lower accuracy than a task-specific model, even after applying ORPE and RAS.

read the original abstract

We propose TextWand, a general-purpose framework that unifies scene text removal, generation, and replacement into a single model. By decomposing complex editing tasks into the atomic primitives of rendering and erasure, TextWand achieves precise control over both text appearance and background integrity. Specifically, we introduce a novel design, Overlay-Reference Positional Encoding (ORPE), to enforce pixel-level layout fidelity and exemplar-driven style control, alongside a new strategy, Region-Adaptive Suppression (RAS), to ensure clean text erasure. To address the absence of a comprehensive benchmark for general-purpose scene text editing among existing single-task datasets, we construct TextWand-Bench. Extensive experiments demonstrate that TextWand outperforms existing leading open-source and closed-source models by delivering superior text content accuracy, layout and style consistency, and overall image quality across scene text removal, generation and replacement tasks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

TextWand unifies removal, generation, and replacement via rendering/erasure split plus ORPE and RAS, but the performance edge hinges on the new benchmark and missing quantitative details.

read the letter

TextWand puts removal, generation, and replacement into one model by treating edits as combinations of rendering and erasure. The concrete additions are Overlay-Reference Positional Encoding to lock down layout and pull style from an exemplar, plus Region-Adaptive Suppression to clean up erasure without stray marks. They also built TextWand-Bench because prior datasets only covered one task at a time.

The work is useful where it shows a single set of weights can handle the three jobs without obvious task-specific hacks. ORPE and RAS look like targeted fixes for layout drift and incomplete erasure, which are common failure modes in this area.

The soft spots sit in the evaluation. A fresh benchmark always carries the risk that test cases were chosen or filtered in ways that favor the new method; the paper needs to spell out the collection process and show that the splits are not post-hoc. The abstract asserts better text accuracy, layout consistency, style match, and image quality than both open and closed models, yet the lack of reported numbers, error bars, or ablation tables in the summary makes it hard to gauge how large or stable those gains are. The rendering/erasure decomposition is a reasonable engineering bet, but any cases where it forces extra artifacts would need explicit discussion.

The paper is aimed at CV groups that build editing tools for photos or video with overlaid text. A reader who needs a practical starting point for multi-task text work would find the framework and benchmark worth looking at. It deserves a serious referee because the core idea is coherent and the new components are falsifiable.

Send it to review, with the main requests being full metric tables, benchmark construction details, and controls for selection bias.

Referee Report

2 major / 0 minor

Summary. The manuscript proposes TextWand, a unified framework for scene text removal, generation, and replacement that decomposes complex edits into the atomic primitives of rendering and erasure. It introduces Overlay-Reference Positional Encoding (ORPE) to enforce pixel-level layout fidelity and exemplar-driven style control, Region-Adaptive Suppression (RAS) for clean text erasure, and constructs the TextWand-Bench benchmark to address the lack of comprehensive multi-task datasets. The central claim is that TextWand outperforms leading open-source and closed-source models in text content accuracy, layout/style consistency, and overall image quality across the three tasks.

Significance. If the quantitative results and ablations hold, the work would supply a single model and benchmark for general-purpose scene text editing, filling a documented gap between single-task datasets and models. The ORPE and RAS components are presented as novel engineering contributions whose effectiveness would need to be demonstrated through controlled comparisons.

major comments (2)

[Abstract] Abstract: the claim that TextWand 'outperforms existing leading open-source and closed-source models' in text content accuracy, layout and style consistency, and image quality is load-bearing for the paper's contribution, yet the abstract supplies no quantitative metrics, ablation tables, error bars, dataset statistics, or baseline names to support it.
[Abstract / Method (implied)] The decomposition of scene-text edits into rendering and erasure primitives is asserted to achieve 'precise control over both text appearance and background integrity' without loss of fidelity, but no section, equation, or experiment is cited that tests whether this reduction introduces artifacts requiring task-specific post-processing, which is the weakest assumption identified.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major point below, agreeing where the manuscript can be strengthened and providing clarifications based on the existing experiments and sections.

read point-by-point responses

Referee: [Abstract] Abstract: the claim that TextWand 'outperforms existing leading open-source and closed-source models' in text content accuracy, layout and style consistency, and image quality is load-bearing for the paper's contribution, yet the abstract supplies no quantitative metrics, ablation tables, error bars, dataset statistics, or baseline names to support it.

Authors: We agree that the abstract is concise and does not embed specific numbers or baseline names. The detailed quantitative results (including metrics on text accuracy, layout consistency, and image quality), ablation studies, error bars where applicable, dataset statistics for TextWand-Bench, and comparisons against named open-source and closed-source baselines are reported in Section 4 and Tables 1–3. To address the concern, we will revise the abstract to incorporate a small number of key quantitative highlights (e.g., average accuracy gains) while preserving its summary nature. revision: yes
Referee: [Abstract / Method (implied)] The decomposition of scene-text edits into rendering and erasure primitives is asserted to achieve 'precise control over both text appearance and background integrity' without loss of fidelity, but no section, equation, or experiment is cited that tests whether this reduction introduces artifacts requiring task-specific post-processing, which is the weakest assumption identified.

Authors: The decomposition into rendering and erasure is the architectural foundation described in Section 3, with ORPE and RAS explicitly designed to maintain fidelity; the unified model is evaluated end-to-end on all three tasks in Section 4 without any task-specific post-processing, and both quantitative metrics and qualitative results demonstrate clean outputs. We acknowledge that an explicit, dedicated experiment isolating potential decomposition-induced artifacts is not separately highlighted. We will therefore add a short paragraph in Section 3.2 citing the relevant experimental evidence that no such artifacts appear and no post-processing is required. revision: partial

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The provided abstract and description introduce TextWand as a framework that decomposes tasks into rendering and erasure primitives, defines new components ORPE and RAS, and constructs a new benchmark TextWand-Bench. No equations, fitted parameters, self-citations, or derivations are described that reduce any claimed result to an input by construction. The central claims rest on experimental comparisons rather than self-referential definitions or renamed known results, making the derivation chain self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 2 invented entities

Abstract-only review yields no explicit free parameters, mathematical axioms, or invented physical entities; the two named techniques are engineering designs rather than new postulated objects.

invented entities (2)

Overlay-Reference Positional Encoding (ORPE) no independent evidence
purpose: Enforce pixel-level layout fidelity and exemplar-driven style control
Novel design introduced to support the unified editing pipeline.
Region-Adaptive Suppression (RAS) no independent evidence
purpose: Ensure clean text erasure without background damage
New strategy presented for the erasure primitive.

pith-pipeline@v0.9.1-grok · 5693 in / 1344 out tokens · 29388 ms · 2026-06-28T01:53:57.838739+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

62 extracted references · 21 canonical work pages · 10 internal anchors

[1]

arXiv preprint arXiv:2504.21682 (2025)

Shu, Y., Zeng, W., Zhao, F., Chen, Z., Li, Z., Yang, X., Zhou, Y., Rota, P., Bai, X., Jin, L., et al.: Visual text processing: A compre- hensive review and unified evaluation. arXiv preprint arXiv:2504.21682 (2025)

work page arXiv 2025
[2]

In: Proceedings of the 31st ACM International Conference on Multimedia, pp

Yang, F., Su, T., Zhou, X., Di, D., Wang, Z., Li, S.: Self-supervised cross-language scene text editing. In: Proceedings of the 31st ACM International Conference on Multimedia, pp. 4546–4554 (2023)

2023
[3]

In: European Conference on Computer Vision, pp

Lee, H., Choi, C.: The surprisingly straight- forward scene text removal method with gated attention and region of interest genera- tion: A comprehensive prominent model anal- ysis. In: European Conference on Computer Vision, pp. 457–472 (2022). Springer

2022
[4]

In: International Con- ference on Machine Learning, pp

Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J.,et al.: Learn- ing transferable visual models from natural language supervision. In: International Con- ference on Machine Learning, pp. 8748–8763 (2021). PmLR

2021
[5]

APSIPA Transactions on Signal and Information Processing13(1) (2024)

Bai, Y., Huang, Z., Gao, W., Yang, S., Liu, J., et al.: Intelligent artistic typography: A com- prehensive review of artistic text design and Article Title15 generation. APSIPA Transactions on Signal and Information Processing13(1) (2024)

2024
[6]

Qwen-Image Technical Report

Wu, C., Li, J., Zhou, J., Lin, J., Gao, K., Yan, K., Yin, S.-m., Bai, S., Xu, X., Chen, Y., et al.: Qwen-image technical report. arXiv preprint arXiv:2508.02324 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[7]

Labs, B.F., Batifol, S., Blattmann, A., Boe- sel, F., Consul, S., Diagne, C., Dockhorn, T., English, J., English, Z., Esser, P., Kulal, S., Lacey, K., Levi, Y., Li, C., Lorenz, D., M¨ uller, J., Podell, D., Rombach, R., Saini, H., Sauer, A., Smith, L.: FLUX.1 Kontext: Flow Matching for In-Context Image Generation and Editing in Latent Space (2025)

2025
[8]

In: Forty-first International Conference on Machine Learning (2024)

Esser, P., Kulal, S., Blattmann, A., Entezari, R., M¨ uller, J., Saini, H., Levi, Y., Lorenz, D., Sauer, A., Boesel, F.,et al.: Scaling rec- tified flow transformers for high-resolution image synthesis. In: Forty-first International Conference on Machine Learning (2024)

2024
[9]

HunyuanImage 3.0 Technical Report

Cao, S., Chen, H., Chen, P., Cheng, Y., Cui, Y., Deng, X., Dong, Y., Gong, K., Gu, T., Gu, X., et al.: Hunyuanimage 3.0 technical report. arXiv preprint arXiv:2509.23951 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[10]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp

Shi, W., Song, Y., Zhang, D., Liu, J., Zou, X.: Fonts: Text rendering with typogra- phy and style controls. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 18463–18474 (2025)

2025
[11]

arXiv preprint arXiv:2505.03329 (2025)

Lan, R., Bai, Y., Duan, X., Li, M., Jin, D., Xu, R., Nie, D., Sun, L., Chu, X.: Flux-text: A simple and advanced diffusion transformer baseline for scene text editing. arXiv preprint arXiv:2505.03329 (2025)

work page arXiv 2025
[12]

arXiv preprint arXiv:2411.15245 (2024)

Tuo, Y., Geng, Y., Bo, L.: Anytext2: Visual text generation and editing with customizable attributes. arXiv preprint arXiv:2411.15245 (2024)

work page arXiv 2024
[13]

In: 2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp

Wang, T., Liu, T., Qu, X., Wu, C., Liu, L., Hu, X.: Glyphmastero: A glyph encoder for high-fidelity scene text editing. In: 2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 28523– 28532 (2025)

2025
[14]

arXiv preprint arXiv:2311.03054 (2023)

Tuo, Y., Xiang, W., He, J.-Y., Geng, Y., Xie, X.: Anytext: Multilingual visual text generation and editing. arXiv preprint arXiv:2311.03054 (2023)

work page arXiv 2023
[15]

In: Proceed- ings of the SIGGRAPH Asia 2025 Conference Papers, pp

Zhao, Y., Gao, Y., Luo, Y., Duan, J., Lin, S., Xiong, L., Lian, Z.: Utdesign: A unified framework for stylized text editing and gen- eration in graphic design images. In: Proceed- ings of the SIGGRAPH Asia 2025 Conference Papers, pp. 1–11 (2025)

2025
[16]

Advances in Neural Information Processing Systems37, 138569–138594 (2024)

Zeng, W., Shu, Y., Li, Z., Yang, D., Zhou, Y.: Textctrl: Diffusion-based scene text edit- ing with prior guidance control. Advances in Neural Information Processing Systems37, 138569–138594 (2024)

2024
[17]

In: 2017 14th IAPR International Conference on Document Anal- ysis and Recognition (ICDAR), vol

Nakamura, T., Zhu, A., Yanai, K., Uchida, S.: Scene text eraser. In: 2017 14th IAPR International Conference on Document Anal- ysis and Recognition (ICDAR), vol. 1, pp. 832–837 (2017). IEEE

2017
[18]

Computer Vision and Image Under- standing201, 103066 (2020)

Tursun, O., Denman, S., Zeng, R., Siva- palan, S., Sridharan, S., Fookes, C.: Mtr- net++: One-stage mask-based scene text eraser. Computer Vision and Image Under- standing201, 103066 (2020)

2020
[19]

IEEE Transactions on Image Processing30, 9306–9320 (2021)

Tang, Z., Miyazaki, T., Sugaya, Y., Omachi, S.: Stroke-based scene text erasing using syn- thetic data for training. IEEE Transactions on Image Processing30, 9306–9320 (2021)

2021
[20]

IEEE Transactions on Image Processing32, 4567– 4580 (2023)

Wang, Y., Xie, H., Wang, Z., Qu, Y., Zhang, Y.: What is the real need for scene text removal? exploring the background integrity and erasure exhaustivity properties. IEEE Transactions on Image Processing32, 4567– 4580 (2023)

2023
[21]

arXiv preprint arXiv:2505.24417 (2025)

Lu, R., Zhang, Y., Liu, J., Wang, H., Song, Y.: Easytext: Controllable diffusion trans- former for multilingual text rendering. arXiv preprint arXiv:2505.24417 (2025)

work page arXiv 2025
[22]

arXiv preprint arXiv:2510.24093 (2025) 16Article Title

Gunawan, A., Teodoro, S., Chen, Y., Kim, S.Y., Oh, J., Kim, M.: Omnitext: A training-free generalist for controllable text-image manipulation. arXiv preprint arXiv:2510.24093 (2025) 16Article Title

work page arXiv 2025
[23]

In: European Conference on Com- puter Vision, pp

Liu, Z., Liang, W., Liang, Z., Luo, C., Li, J., Huang, G., Yuan, Y.: Glyph-byt5: A cus- tomized text encoder for accurate visual text rendering. In: European Conference on Com- puter Vision, pp. 361–377 (2024). Springer

2024
[24]

Advances in Neural Infor- mation Processing Systems36, 9353–9387 (2023)

Chen, J., Huang, Y., Lv, T., Cui, L., Chen, Q., Wei, F.: Textdiffuser: Diffusion models as text painters. Advances in Neural Infor- mation Processing Systems36, 9353–9387 (2023)

2023
[25]

In: European Conference on Computer Vision, pp

Chen, J., Huang, Y., Lv, T., Cui, L., Chen, Q., Wei, F.: Textdiffuser-2: Unleashing the power of language models for text render- ing. In: European Conference on Computer Vision, pp. 386–402 (2024). Springer

2024
[26]

Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-Resolution Image Synthesis with Latent Diffusion Models (2021)

2021
[27]

Labs, B.F.: FLUX (2024)

2024
[28]

SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis

Podell, D., English, Z., Lacey, K., Blattmann, A., Dockhorn, T., M¨ uller, J., Penna, J., Rom- bach, R.: Sdxl: Improving latent diffusion models for high-resolution image synthesis. arXiv preprint arXiv:2307.01952 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[29]

Z-Image: An Efficient Image Generation Foundation Model with Single-Stream Diffusion Transformer

Team, Z.-I.: Z-image: An efficient image generation foundation model with single- stream diffusion transformer. arXiv preprint arXiv:2511.22699 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[30]

Seedream 4.0: Toward Next-generation Multimodal Image Generation

Seedream, T., Chen, Y., Gao, Y., Gong, L., Guo, M., Guo, Q., Guo, Z., Hou, X., Huang, W., Huang, Y., et al.: Seedream 4.0: Toward next-generation multimodal image generation. arXiv preprint arXiv:2509.20427 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[31]

Step1X-Edit: A Practical Framework for General Image Editing

Liu, S., Han, Y., Xing, P., Yin, F., Wang, R., Cheng, W., Liao, J., Wang, Y., Fu, H., Han, C., Li, G., Peng, Y., Sun, Q., Wu, J., Cai, Y., Ge, Z., Ming, R., Xia, L., Zeng, X., Zhu, Y., Jiao, B., Zhang, X., Yu, G., Jiang, D.: Step1x-edit: A practical framework for general image editing. arXiv preprint arXiv:2504.17761 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[32]

In: The Thirty- ninth Annual Conference on Neural Informa- tion Processing Systems (2025)

Wang, J., Chen, Y., Yu, J., Lu, G., Pei, W.: Editinfinity: Image editing with binary- quantized generative models. In: The Thirty- ninth Annual Conference on Neural Informa- tion Processing Systems (2025)

2025
[33]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp

Brooks, T., Holynski, A., Efros, A.A.: Instructpix2pix: Learning to follow image editing instructions. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 18392–18402 (2023)

2023
[34]

arXiv preprint arXiv:2505.19149 (2025)

Wang, S., Li, W., Wang, Q., Zhao, S., Zhang, J.: Mind-edit: Mllm insight-driven editing via language-vision projection. arXiv preprint arXiv:2505.19149 (2025)

work page arXiv 2025
[35]

In: Proceedings of IEEE/CVF Inter- national Conference on Computer Vision (ICCV) (2023)

Zhang, L., Rao, A., Agrawala, M.: Adding conditional control to text-to-image diffusion models. In: Proceedings of IEEE/CVF Inter- national Conference on Computer Vision (ICCV) (2023)

2023
[36]

In: Pro- ceedings of the AAAI Conference on Artificial Intelligence, vol

Mou, C., Wang, X., Xie, L., Wu, Y., Zhang, J., Qi, Z., Shan, Y.: T2i-adapter: Learning adapters to dig out more controllable ability for text-to-image diffusion models. In: Pro- ceedings of the AAAI Conference on Artificial Intelligence, vol. 38, pp. 4296–4304 (2024)

2024
[37]

In: European Conference on Computer Vision, pp

Ju, X., Liu, X., Wang, X., Bian, Y., Shan, Y., Xu, Q.: Brushnet: A plug-and-play image inpainting model with decomposed dual- branch diffusion. In: European Conference on Computer Vision, pp. 150–168 (2024). Springer

2024
[38]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recogni- tion, pp

Mou, C., Wang, X., Song, J., Shan, Y., Zhang, J.: Diffeditor: Boosting accuracy and flexibility on diffusion-based image editing. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recogni- tion, pp. 8488–8497 (2024)

2024
[39]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2025) Article Title17

Liu, Z., Yu, Y., Ouyang, H., Wang, Q., Cheng, K.L., Wang, W., Liu, Z., Chen, Q., Shen, Y.: Magicquill: An intelligent interac- tive image editing system. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2025) Article Title17

2025
[40]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp

Shi, Y., Xue, C., Liew, J.H., Pan, J., Yan, H., Zhang, W., Tan, V.Y., Bai, S.: Dragdiffusion: Harnessing diffusion models for interactive point-based image editing. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8839– 8849 (2024)

2024
[41]

In: Proceedings of the Computer Vision and Pattern Recognition Conference, pp

Park, J., Gim, J., Lee, K., Lee, S., Im, S.: Style-editor: Text-driven object-centric style editing. In: Proceedings of the Computer Vision and Pattern Recognition Conference, pp. 18281–18291 (2025)

2025
[42]

In: ICASSP 2025-2025 IEEE International Con- ference on Acoustics, Speech and Signal Pro- cessing (ICASSP), pp

Dai, M., Zhou, Q., Yi, R., Ma, L.: Diffusefist: A fast image-guided style transfer method for adapting large-scale diffusion models. In: ICASSP 2025-2025 IEEE International Con- ference on Acoustics, Speech and Signal Pro- cessing (ICASSP), pp. 1–5 (2025). IEEE

2025
[43]

Uniworld-V2: Reinforce Image Editing with Diffusion Negative-aware Finetuning and MLLM Implicit Feedback

Li, Z., Liu, Z., Zhang, Q., Lin, B., Wu, F., Yuan, S., Yan, Z., Ye, Y., Yu, W., Niu, Y., et al.: Uniworld-v2: Reinforce image edit- ing with diffusion negative-aware finetuning and mllm implicit feedback. arXiv preprint arXiv:2510.16888 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[44]

In: Proceedings of the Computer Vision and Pattern Recogni- tion Conference, pp

Yu, Q., Chow, W., Yue, Z., Pan, K., Wu, Y., Wan, X., Li, J., Tang, S., Zhang, H., Zhuang, Y.: Anyedit: Mastering unified high-quality image editing for any idea. In: Proceedings of the Computer Vision and Pattern Recogni- tion Conference, pp. 26125–26135 (2025)

2025
[45]

UniWorld-V1: High-Resolution Semantic Encoders for Unified Visual Understanding and Generation

Lin, B., Li, Z., Cheng, X., Niu, Y., Ye, Y., He, X., Yuan, S., Yu, W., Wang, S., Ge, Y., et al.: Uniworld-v1: High-resolution semantic encoders for unified visual understanding and generation. arXiv preprint arXiv:2506.03147 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[46]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp

Kulikov, V., Kleiner, M., Huberman- Spiegelglas, I., Michaeli, T.: Flowedit: Inversion-free text-based editing using pre- trained flow models. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 19721–19730 (2025)

2025
[47]

In: ACM SIGGRAPH 2024 Conference Papers, pp

Liu, Y., Lian, Z.: Qt-font: High-efficiency font synthesis via quadtree-based diffusion mod- els. In: ACM SIGGRAPH 2024 Conference Papers, pp. 1–11 (2024)

2024
[48]

arXiv preprint arXiv:2304.10097 (2023)

Su, T., Yang, F., Zhou, X., Di, D., Wang, Z., Li, S.: Scene style text editing. arXiv preprint arXiv:2304.10097 (2023)

work page arXiv 2023
[49]

arXiv preprint arXiv:2304.05568 (2023)

Ji, J., Zhang, G., Wang, Z., Hou, B., Zhang, Z., Price, B., Chang, S.: Improving diffu- sion models for scene text editing with dual encoders. arXiv preprint arXiv:2304.05568 (2023)

work page arXiv 2023
[50]

In: Proceedings of the AAAI Con- ference on Artificial Intelligence, vol

Yang, Z., Peng, D., Kong, Y., Zhang, Y., Yao, C., Jin, L.: Fontdiffuser: One-shot font gener- ation via denoising diffusion with multi-scale content aggregation and style contrastive learning. In: Proceedings of the AAAI Con- ference on Artificial Intelligence, vol. 38, pp. 6603–6611 (2024)

2024
[51]

In: European Conference on Computer Vision, pp

Nikolaidou, K., Retsinas, G., Sfikas, G., Liwicki, M.: Diffusionpen: towards control- ling the style of handwritten text genera- tion. In: European Conference on Computer Vision, pp. 417–434 (2024). Springer

2024
[52]

In: 2019 International Confer- ence on Document Analysis and Recognition (ICDAR), pp

Nakamura, T.N., Zhu, A., Uchida, S.: Scene text magnifier. In: 2019 International Confer- ence on Document Analysis and Recognition (ICDAR), pp. 825–830 (2019)

2019
[53]

In: Proceedings of the Computer Vision and Pattern Recognition Conference (CVPR), pp

Fang, Z., Lyu, P., Wu, J., Zhang, C., Yu, J., Lu, G., Pei, W.: Recognition-synergistic scene text editing. In: Proceedings of the Computer Vision and Pattern Recognition Conference (CVPR), pp. 13104–13113 (2025)

2025
[54]

In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G

Zhao, Y., Lian, Z.: Udifftext: A unified frame- work for high-quality text synthesis in arbi- trary images via character-aware diffusion models. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds.) European Conference on Computer Vision. Springer (2024)

2024
[55]

arXiv preprint arXiv:2502.10999 (2025) 18Article Title

Jiang, B., Yuan, Y., Bai, X., Hao, Z., Yin, A., Hu, Y., Liao, W., Ungar, L., Taylor, C.J.: Controltext: Unlocking controllable fonts in multilingual text rendering without font annotations. arXiv preprint arXiv:2502.10999 (2025) 18Article Title

work page arXiv 2025
[56]

In: International Conference on Learning Representations

Kingma, D.P., Welling, M.,et al.: Auto- encoding variational bayes. In: International Conference on Learning Representations. Banff, Canada (2014)

2014
[57]

arXiv preprint arXiv:2506.10741 (2025)

Chen, S., Lai, J., Gao, J., Ye, T., Chen, H., Shi, H., Shao, S., Lin, Y., Fei, S., Xing, Z., Jin, Y., Luo, J., Wei, X., Zhu, L.: Postercraft: Rethinking high-quality aesthetic poster gen- eration in a unified framework. arXiv preprint arXiv:2506.10741 (2025)

work page arXiv 2025
[58]

In: International Conference on Learning Representations (2022)

Hu, E.J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., Chen, W.: LoRA: Low-rank adaptation of large language mod- els. In: International Conference on Learning Representations (2022)

2022
[59]

In: Proceedings of the IEEE Conference on Com- puter Vision and Pattern Recognition, pp

Zhang, R., Isola, P., Efros, A.A., Shechtman, E., Wang, O.: The unreasonable effectiveness of deep features as a perceptual metric. In: Proceedings of the IEEE Conference on Com- puter Vision and Pattern Recognition, pp. 586–595 (2018)

2018
[60]

IEEE Transactions on Image Processing13(4), 600–612 (2004)

Wang, Z., Bovik, A.C., Sheikh, H.R., Simon- celli, E.P.: Image quality assessment: from error visibility to structural similarity. IEEE Transactions on Image Processing13(4), 600–612 (2004)

2004
[61]

Qwen3-VL Technical Report

Bai, S., Cai, Y., Chen, R., Chen, K., Chen, X., Cheng, Z., Deng, L., Ding, W., Gao, C., Ge, C., et al.: Qwen3-vl technical report. arXiv preprint arXiv:2511.21631 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[62]

LongCat-Image Technical Report

Team, M.L., Ma, H., Tan, H., Huang, J., Wu, J., He, J.-Y., Gao, L., Xiao, S., Wei, X., Ma, X., Cai, X., Guan, Y., Hu, J.: Longcat-image technical report. arXiv preprint arXiv:2512.07584 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[1] [1]

arXiv preprint arXiv:2504.21682 (2025)

Shu, Y., Zeng, W., Zhao, F., Chen, Z., Li, Z., Yang, X., Zhou, Y., Rota, P., Bai, X., Jin, L., et al.: Visual text processing: A compre- hensive review and unified evaluation. arXiv preprint arXiv:2504.21682 (2025)

work page arXiv 2025

[2] [2]

In: Proceedings of the 31st ACM International Conference on Multimedia, pp

Yang, F., Su, T., Zhou, X., Di, D., Wang, Z., Li, S.: Self-supervised cross-language scene text editing. In: Proceedings of the 31st ACM International Conference on Multimedia, pp. 4546–4554 (2023)

2023

[3] [3]

In: European Conference on Computer Vision, pp

Lee, H., Choi, C.: The surprisingly straight- forward scene text removal method with gated attention and region of interest genera- tion: A comprehensive prominent model anal- ysis. In: European Conference on Computer Vision, pp. 457–472 (2022). Springer

2022

[4] [4]

In: International Con- ference on Machine Learning, pp

Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J.,et al.: Learn- ing transferable visual models from natural language supervision. In: International Con- ference on Machine Learning, pp. 8748–8763 (2021). PmLR

2021

[5] [5]

APSIPA Transactions on Signal and Information Processing13(1) (2024)

Bai, Y., Huang, Z., Gao, W., Yang, S., Liu, J., et al.: Intelligent artistic typography: A com- prehensive review of artistic text design and Article Title15 generation. APSIPA Transactions on Signal and Information Processing13(1) (2024)

2024

[6] [6]

Qwen-Image Technical Report

Wu, C., Li, J., Zhou, J., Lin, J., Gao, K., Yan, K., Yin, S.-m., Bai, S., Xu, X., Chen, Y., et al.: Qwen-image technical report. arXiv preprint arXiv:2508.02324 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[7] [7]

Labs, B.F., Batifol, S., Blattmann, A., Boe- sel, F., Consul, S., Diagne, C., Dockhorn, T., English, J., English, Z., Esser, P., Kulal, S., Lacey, K., Levi, Y., Li, C., Lorenz, D., M¨ uller, J., Podell, D., Rombach, R., Saini, H., Sauer, A., Smith, L.: FLUX.1 Kontext: Flow Matching for In-Context Image Generation and Editing in Latent Space (2025)

2025

[8] [8]

In: Forty-first International Conference on Machine Learning (2024)

Esser, P., Kulal, S., Blattmann, A., Entezari, R., M¨ uller, J., Saini, H., Levi, Y., Lorenz, D., Sauer, A., Boesel, F.,et al.: Scaling rec- tified flow transformers for high-resolution image synthesis. In: Forty-first International Conference on Machine Learning (2024)

2024

[9] [9]

HunyuanImage 3.0 Technical Report

Cao, S., Chen, H., Chen, P., Cheng, Y., Cui, Y., Deng, X., Dong, Y., Gong, K., Gu, T., Gu, X., et al.: Hunyuanimage 3.0 technical report. arXiv preprint arXiv:2509.23951 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[10] [10]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp

Shi, W., Song, Y., Zhang, D., Liu, J., Zou, X.: Fonts: Text rendering with typogra- phy and style controls. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 18463–18474 (2025)

2025

[11] [11]

arXiv preprint arXiv:2505.03329 (2025)

Lan, R., Bai, Y., Duan, X., Li, M., Jin, D., Xu, R., Nie, D., Sun, L., Chu, X.: Flux-text: A simple and advanced diffusion transformer baseline for scene text editing. arXiv preprint arXiv:2505.03329 (2025)

work page arXiv 2025

[12] [12]

arXiv preprint arXiv:2411.15245 (2024)

Tuo, Y., Geng, Y., Bo, L.: Anytext2: Visual text generation and editing with customizable attributes. arXiv preprint arXiv:2411.15245 (2024)

work page arXiv 2024

[13] [13]

In: 2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp

Wang, T., Liu, T., Qu, X., Wu, C., Liu, L., Hu, X.: Glyphmastero: A glyph encoder for high-fidelity scene text editing. In: 2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 28523– 28532 (2025)

2025

[14] [14]

arXiv preprint arXiv:2311.03054 (2023)

Tuo, Y., Xiang, W., He, J.-Y., Geng, Y., Xie, X.: Anytext: Multilingual visual text generation and editing. arXiv preprint arXiv:2311.03054 (2023)

work page arXiv 2023

[15] [15]

In: Proceed- ings of the SIGGRAPH Asia 2025 Conference Papers, pp

Zhao, Y., Gao, Y., Luo, Y., Duan, J., Lin, S., Xiong, L., Lian, Z.: Utdesign: A unified framework for stylized text editing and gen- eration in graphic design images. In: Proceed- ings of the SIGGRAPH Asia 2025 Conference Papers, pp. 1–11 (2025)

2025

[16] [16]

Advances in Neural Information Processing Systems37, 138569–138594 (2024)

Zeng, W., Shu, Y., Li, Z., Yang, D., Zhou, Y.: Textctrl: Diffusion-based scene text edit- ing with prior guidance control. Advances in Neural Information Processing Systems37, 138569–138594 (2024)

2024

[17] [17]

In: 2017 14th IAPR International Conference on Document Anal- ysis and Recognition (ICDAR), vol

Nakamura, T., Zhu, A., Yanai, K., Uchida, S.: Scene text eraser. In: 2017 14th IAPR International Conference on Document Anal- ysis and Recognition (ICDAR), vol. 1, pp. 832–837 (2017). IEEE

2017

[18] [18]

Computer Vision and Image Under- standing201, 103066 (2020)

Tursun, O., Denman, S., Zeng, R., Siva- palan, S., Sridharan, S., Fookes, C.: Mtr- net++: One-stage mask-based scene text eraser. Computer Vision and Image Under- standing201, 103066 (2020)

2020

[19] [19]

IEEE Transactions on Image Processing30, 9306–9320 (2021)

Tang, Z., Miyazaki, T., Sugaya, Y., Omachi, S.: Stroke-based scene text erasing using syn- thetic data for training. IEEE Transactions on Image Processing30, 9306–9320 (2021)

2021

[20] [20]

IEEE Transactions on Image Processing32, 4567– 4580 (2023)

Wang, Y., Xie, H., Wang, Z., Qu, Y., Zhang, Y.: What is the real need for scene text removal? exploring the background integrity and erasure exhaustivity properties. IEEE Transactions on Image Processing32, 4567– 4580 (2023)

2023

[21] [21]

arXiv preprint arXiv:2505.24417 (2025)

Lu, R., Zhang, Y., Liu, J., Wang, H., Song, Y.: Easytext: Controllable diffusion trans- former for multilingual text rendering. arXiv preprint arXiv:2505.24417 (2025)

work page arXiv 2025

[22] [22]

arXiv preprint arXiv:2510.24093 (2025) 16Article Title

Gunawan, A., Teodoro, S., Chen, Y., Kim, S.Y., Oh, J., Kim, M.: Omnitext: A training-free generalist for controllable text-image manipulation. arXiv preprint arXiv:2510.24093 (2025) 16Article Title

work page arXiv 2025

[23] [23]

In: European Conference on Com- puter Vision, pp

Liu, Z., Liang, W., Liang, Z., Luo, C., Li, J., Huang, G., Yuan, Y.: Glyph-byt5: A cus- tomized text encoder for accurate visual text rendering. In: European Conference on Com- puter Vision, pp. 361–377 (2024). Springer

2024

[24] [24]

Advances in Neural Infor- mation Processing Systems36, 9353–9387 (2023)

Chen, J., Huang, Y., Lv, T., Cui, L., Chen, Q., Wei, F.: Textdiffuser: Diffusion models as text painters. Advances in Neural Infor- mation Processing Systems36, 9353–9387 (2023)

2023

[25] [25]

In: European Conference on Computer Vision, pp

Chen, J., Huang, Y., Lv, T., Cui, L., Chen, Q., Wei, F.: Textdiffuser-2: Unleashing the power of language models for text render- ing. In: European Conference on Computer Vision, pp. 386–402 (2024). Springer

2024

[26] [26]

Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-Resolution Image Synthesis with Latent Diffusion Models (2021)

2021

[27] [27]

Labs, B.F.: FLUX (2024)

2024

[28] [28]

SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis

Podell, D., English, Z., Lacey, K., Blattmann, A., Dockhorn, T., M¨ uller, J., Penna, J., Rom- bach, R.: Sdxl: Improving latent diffusion models for high-resolution image synthesis. arXiv preprint arXiv:2307.01952 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023

[29] [29]

Z-Image: An Efficient Image Generation Foundation Model with Single-Stream Diffusion Transformer

Team, Z.-I.: Z-image: An efficient image generation foundation model with single- stream diffusion transformer. arXiv preprint arXiv:2511.22699 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[30] [30]

Seedream 4.0: Toward Next-generation Multimodal Image Generation

Seedream, T., Chen, Y., Gao, Y., Gong, L., Guo, M., Guo, Q., Guo, Z., Hou, X., Huang, W., Huang, Y., et al.: Seedream 4.0: Toward next-generation multimodal image generation. arXiv preprint arXiv:2509.20427 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[31] [31]

Step1X-Edit: A Practical Framework for General Image Editing

Liu, S., Han, Y., Xing, P., Yin, F., Wang, R., Cheng, W., Liao, J., Wang, Y., Fu, H., Han, C., Li, G., Peng, Y., Sun, Q., Wu, J., Cai, Y., Ge, Z., Ming, R., Xia, L., Zeng, X., Zhu, Y., Jiao, B., Zhang, X., Yu, G., Jiang, D.: Step1x-edit: A practical framework for general image editing. arXiv preprint arXiv:2504.17761 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[32] [32]

In: The Thirty- ninth Annual Conference on Neural Informa- tion Processing Systems (2025)

Wang, J., Chen, Y., Yu, J., Lu, G., Pei, W.: Editinfinity: Image editing with binary- quantized generative models. In: The Thirty- ninth Annual Conference on Neural Informa- tion Processing Systems (2025)

2025

[33] [33]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp

Brooks, T., Holynski, A., Efros, A.A.: Instructpix2pix: Learning to follow image editing instructions. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 18392–18402 (2023)

2023

[34] [34]

arXiv preprint arXiv:2505.19149 (2025)

Wang, S., Li, W., Wang, Q., Zhao, S., Zhang, J.: Mind-edit: Mllm insight-driven editing via language-vision projection. arXiv preprint arXiv:2505.19149 (2025)

work page arXiv 2025

[35] [35]

In: Proceedings of IEEE/CVF Inter- national Conference on Computer Vision (ICCV) (2023)

Zhang, L., Rao, A., Agrawala, M.: Adding conditional control to text-to-image diffusion models. In: Proceedings of IEEE/CVF Inter- national Conference on Computer Vision (ICCV) (2023)

2023

[36] [36]

In: Pro- ceedings of the AAAI Conference on Artificial Intelligence, vol

Mou, C., Wang, X., Xie, L., Wu, Y., Zhang, J., Qi, Z., Shan, Y.: T2i-adapter: Learning adapters to dig out more controllable ability for text-to-image diffusion models. In: Pro- ceedings of the AAAI Conference on Artificial Intelligence, vol. 38, pp. 4296–4304 (2024)

2024

[37] [37]

In: European Conference on Computer Vision, pp

Ju, X., Liu, X., Wang, X., Bian, Y., Shan, Y., Xu, Q.: Brushnet: A plug-and-play image inpainting model with decomposed dual- branch diffusion. In: European Conference on Computer Vision, pp. 150–168 (2024). Springer

2024

[38] [38]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recogni- tion, pp

Mou, C., Wang, X., Song, J., Shan, Y., Zhang, J.: Diffeditor: Boosting accuracy and flexibility on diffusion-based image editing. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recogni- tion, pp. 8488–8497 (2024)

2024

[39] [39]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2025) Article Title17

Liu, Z., Yu, Y., Ouyang, H., Wang, Q., Cheng, K.L., Wang, W., Liu, Z., Chen, Q., Shen, Y.: Magicquill: An intelligent interac- tive image editing system. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2025) Article Title17

2025

[40] [40]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp

Shi, Y., Xue, C., Liew, J.H., Pan, J., Yan, H., Zhang, W., Tan, V.Y., Bai, S.: Dragdiffusion: Harnessing diffusion models for interactive point-based image editing. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8839– 8849 (2024)

2024

[41] [41]

In: Proceedings of the Computer Vision and Pattern Recognition Conference, pp

Park, J., Gim, J., Lee, K., Lee, S., Im, S.: Style-editor: Text-driven object-centric style editing. In: Proceedings of the Computer Vision and Pattern Recognition Conference, pp. 18281–18291 (2025)

2025

[42] [42]

In: ICASSP 2025-2025 IEEE International Con- ference on Acoustics, Speech and Signal Pro- cessing (ICASSP), pp

Dai, M., Zhou, Q., Yi, R., Ma, L.: Diffusefist: A fast image-guided style transfer method for adapting large-scale diffusion models. In: ICASSP 2025-2025 IEEE International Con- ference on Acoustics, Speech and Signal Pro- cessing (ICASSP), pp. 1–5 (2025). IEEE

2025

[43] [43]

Uniworld-V2: Reinforce Image Editing with Diffusion Negative-aware Finetuning and MLLM Implicit Feedback

Li, Z., Liu, Z., Zhang, Q., Lin, B., Wu, F., Yuan, S., Yan, Z., Ye, Y., Yu, W., Niu, Y., et al.: Uniworld-v2: Reinforce image edit- ing with diffusion negative-aware finetuning and mllm implicit feedback. arXiv preprint arXiv:2510.16888 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[44] [44]

In: Proceedings of the Computer Vision and Pattern Recogni- tion Conference, pp

Yu, Q., Chow, W., Yue, Z., Pan, K., Wu, Y., Wan, X., Li, J., Tang, S., Zhang, H., Zhuang, Y.: Anyedit: Mastering unified high-quality image editing for any idea. In: Proceedings of the Computer Vision and Pattern Recogni- tion Conference, pp. 26125–26135 (2025)

2025

[45] [45]

UniWorld-V1: High-Resolution Semantic Encoders for Unified Visual Understanding and Generation

Lin, B., Li, Z., Cheng, X., Niu, Y., Ye, Y., He, X., Yuan, S., Yu, W., Wang, S., Ge, Y., et al.: Uniworld-v1: High-resolution semantic encoders for unified visual understanding and generation. arXiv preprint arXiv:2506.03147 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[46] [46]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp

Kulikov, V., Kleiner, M., Huberman- Spiegelglas, I., Michaeli, T.: Flowedit: Inversion-free text-based editing using pre- trained flow models. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 19721–19730 (2025)

2025

[47] [47]

In: ACM SIGGRAPH 2024 Conference Papers, pp

Liu, Y., Lian, Z.: Qt-font: High-efficiency font synthesis via quadtree-based diffusion mod- els. In: ACM SIGGRAPH 2024 Conference Papers, pp. 1–11 (2024)

2024

[48] [48]

arXiv preprint arXiv:2304.10097 (2023)

Su, T., Yang, F., Zhou, X., Di, D., Wang, Z., Li, S.: Scene style text editing. arXiv preprint arXiv:2304.10097 (2023)

work page arXiv 2023

[49] [49]

arXiv preprint arXiv:2304.05568 (2023)

Ji, J., Zhang, G., Wang, Z., Hou, B., Zhang, Z., Price, B., Chang, S.: Improving diffu- sion models for scene text editing with dual encoders. arXiv preprint arXiv:2304.05568 (2023)

work page arXiv 2023

[50] [50]

In: Proceedings of the AAAI Con- ference on Artificial Intelligence, vol

Yang, Z., Peng, D., Kong, Y., Zhang, Y., Yao, C., Jin, L.: Fontdiffuser: One-shot font gener- ation via denoising diffusion with multi-scale content aggregation and style contrastive learning. In: Proceedings of the AAAI Con- ference on Artificial Intelligence, vol. 38, pp. 6603–6611 (2024)

2024

[51] [51]

In: European Conference on Computer Vision, pp

Nikolaidou, K., Retsinas, G., Sfikas, G., Liwicki, M.: Diffusionpen: towards control- ling the style of handwritten text genera- tion. In: European Conference on Computer Vision, pp. 417–434 (2024). Springer

2024

[52] [52]

In: 2019 International Confer- ence on Document Analysis and Recognition (ICDAR), pp

Nakamura, T.N., Zhu, A., Uchida, S.: Scene text magnifier. In: 2019 International Confer- ence on Document Analysis and Recognition (ICDAR), pp. 825–830 (2019)

2019

[53] [53]

In: Proceedings of the Computer Vision and Pattern Recognition Conference (CVPR), pp

Fang, Z., Lyu, P., Wu, J., Zhang, C., Yu, J., Lu, G., Pei, W.: Recognition-synergistic scene text editing. In: Proceedings of the Computer Vision and Pattern Recognition Conference (CVPR), pp. 13104–13113 (2025)

2025

[54] [54]

In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G

Zhao, Y., Lian, Z.: Udifftext: A unified frame- work for high-quality text synthesis in arbi- trary images via character-aware diffusion models. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds.) European Conference on Computer Vision. Springer (2024)

2024

[55] [55]

arXiv preprint arXiv:2502.10999 (2025) 18Article Title

Jiang, B., Yuan, Y., Bai, X., Hao, Z., Yin, A., Hu, Y., Liao, W., Ungar, L., Taylor, C.J.: Controltext: Unlocking controllable fonts in multilingual text rendering without font annotations. arXiv preprint arXiv:2502.10999 (2025) 18Article Title

work page arXiv 2025

[56] [56]

In: International Conference on Learning Representations

Kingma, D.P., Welling, M.,et al.: Auto- encoding variational bayes. In: International Conference on Learning Representations. Banff, Canada (2014)

2014

[57] [57]

arXiv preprint arXiv:2506.10741 (2025)

Chen, S., Lai, J., Gao, J., Ye, T., Chen, H., Shi, H., Shao, S., Lin, Y., Fei, S., Xing, Z., Jin, Y., Luo, J., Wei, X., Zhu, L.: Postercraft: Rethinking high-quality aesthetic poster gen- eration in a unified framework. arXiv preprint arXiv:2506.10741 (2025)

work page arXiv 2025

[58] [58]

In: International Conference on Learning Representations (2022)

Hu, E.J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., Chen, W.: LoRA: Low-rank adaptation of large language mod- els. In: International Conference on Learning Representations (2022)

2022

[59] [59]

In: Proceedings of the IEEE Conference on Com- puter Vision and Pattern Recognition, pp

Zhang, R., Isola, P., Efros, A.A., Shechtman, E., Wang, O.: The unreasonable effectiveness of deep features as a perceptual metric. In: Proceedings of the IEEE Conference on Com- puter Vision and Pattern Recognition, pp. 586–595 (2018)

2018

[60] [60]

IEEE Transactions on Image Processing13(4), 600–612 (2004)

Wang, Z., Bovik, A.C., Sheikh, H.R., Simon- celli, E.P.: Image quality assessment: from error visibility to structural similarity. IEEE Transactions on Image Processing13(4), 600–612 (2004)

2004

[61] [61]

Qwen3-VL Technical Report

Bai, S., Cai, Y., Chen, R., Chen, K., Chen, X., Cheng, Z., Deng, L., Ding, W., Gao, C., Ge, C., et al.: Qwen3-vl technical report. arXiv preprint arXiv:2511.21631 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[62] [62]

LongCat-Image Technical Report

Team, M.L., Ma, H., Tan, H., Huang, J., Wu, J., He, J.-Y., Gao, L., Xiao, S., Wei, X., Ma, X., Cai, X., Guan, Y., Hu, J.: Longcat-image technical report. arXiv preprint arXiv:2512.07584 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025