pith. sign in

arxiv: 2606.05730 · v1 · pith:R4AOFJSKnew · submitted 2026-06-04 · 💻 cs.CV

TextWand: A Unified Framework for Scene Text Editing

Pith reviewed 2026-06-28 01:53 UTC · model grok-4.3

classification 💻 cs.CV
keywords scene text editingtext removaltext generationtext replacementunified frameworkpositional encodingerasure suppressionbenchmark dataset
0
0 comments X

The pith

TextWand unifies scene text removal, generation and replacement in one model by splitting edits into rendering and erasure primitives.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents TextWand as a single framework that handles removal, generation and replacement of text in natural scene images. It works by reducing these tasks to two basic operations: rendering new text and erasing existing text. Two new components support this decomposition: Overlay-Reference Positional Encoding maintains exact layout and copies style from reference examples, while Region-Adaptive Suppression produces clean erasures without leftover artifacts. Because prior datasets cover only one task at a time, the authors also release TextWand-Bench. Experiments on that benchmark show higher text accuracy, layout consistency and final image quality than both open-source and closed-source alternatives.

Core claim

By decomposing complex scene-text edits into the atomic primitives of rendering and erasure, TextWand achieves precise control over text appearance and background integrity. Overlay-Reference Positional Encoding enforces pixel-level layout fidelity and exemplar-driven style control, while Region-Adaptive Suppression ensures clean text erasure. The resulting model outperforms existing leading open-source and closed-source models on text content accuracy, layout and style consistency, and overall image quality across all three editing tasks.

What carries the argument

Overlay-Reference Positional Encoding (ORPE) for pixel-level layout fidelity and exemplar-driven style control, paired with Region-Adaptive Suppression (RAS) for clean text erasure.

If this is right

  • One model replaces separate networks for removal, generation and replacement.
  • Pixel-level layout and style are preserved without extra alignment steps.
  • Background regions remain intact after text is erased or overwritten.
  • TextWand-Bench supplies the first unified test set for general-purpose scene text editing.
  • Superior accuracy and consistency hold across open-source and closed-source baselines.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same rendering-erasure split could be tested on other structured image elements such as logos or signs.
  • A single trained model reduces memory and inference cost compared with maintaining three separate editors.
  • If the decomposition generalizes, the framework might extend to short video clips by applying the same primitives frame by frame.
  • Downstream applications such as automatic sign translation or document redaction could adopt the unified model directly.

Load-bearing premise

Complex scene-text edits can be reliably decomposed into rendering and erasure without loss of fidelity or introduction of artifacts that would require task-specific post-processing.

What would settle it

A set of scene-text edits where the rendering-plus-erasure decomposition produces visible artifacts or lower accuracy than a task-specific model, even after applying ORPE and RAS.

read the original abstract

We propose TextWand, a general-purpose framework that unifies scene text removal, generation, and replacement into a single model. By decomposing complex editing tasks into the atomic primitives of rendering and erasure, TextWand achieves precise control over both text appearance and background integrity. Specifically, we introduce a novel design, Overlay-Reference Positional Encoding (ORPE), to enforce pixel-level layout fidelity and exemplar-driven style control, alongside a new strategy, Region-Adaptive Suppression (RAS), to ensure clean text erasure. To address the absence of a comprehensive benchmark for general-purpose scene text editing among existing single-task datasets, we construct TextWand-Bench. Extensive experiments demonstrate that TextWand outperforms existing leading open-source and closed-source models by delivering superior text content accuracy, layout and style consistency, and overall image quality across scene text removal, generation and replacement tasks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The manuscript proposes TextWand, a unified framework for scene text removal, generation, and replacement that decomposes complex edits into the atomic primitives of rendering and erasure. It introduces Overlay-Reference Positional Encoding (ORPE) to enforce pixel-level layout fidelity and exemplar-driven style control, Region-Adaptive Suppression (RAS) for clean text erasure, and constructs the TextWand-Bench benchmark to address the lack of comprehensive multi-task datasets. The central claim is that TextWand outperforms leading open-source and closed-source models in text content accuracy, layout/style consistency, and overall image quality across the three tasks.

Significance. If the quantitative results and ablations hold, the work would supply a single model and benchmark for general-purpose scene text editing, filling a documented gap between single-task datasets and models. The ORPE and RAS components are presented as novel engineering contributions whose effectiveness would need to be demonstrated through controlled comparisons.

major comments (2)
  1. [Abstract] Abstract: the claim that TextWand 'outperforms existing leading open-source and closed-source models' in text content accuracy, layout and style consistency, and image quality is load-bearing for the paper's contribution, yet the abstract supplies no quantitative metrics, ablation tables, error bars, dataset statistics, or baseline names to support it.
  2. [Abstract / Method (implied)] The decomposition of scene-text edits into rendering and erasure primitives is asserted to achieve 'precise control over both text appearance and background integrity' without loss of fidelity, but no section, equation, or experiment is cited that tests whether this reduction introduces artifacts requiring task-specific post-processing, which is the weakest assumption identified.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major point below, agreeing where the manuscript can be strengthened and providing clarifications based on the existing experiments and sections.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the claim that TextWand 'outperforms existing leading open-source and closed-source models' in text content accuracy, layout and style consistency, and image quality is load-bearing for the paper's contribution, yet the abstract supplies no quantitative metrics, ablation tables, error bars, dataset statistics, or baseline names to support it.

    Authors: We agree that the abstract is concise and does not embed specific numbers or baseline names. The detailed quantitative results (including metrics on text accuracy, layout consistency, and image quality), ablation studies, error bars where applicable, dataset statistics for TextWand-Bench, and comparisons against named open-source and closed-source baselines are reported in Section 4 and Tables 1–3. To address the concern, we will revise the abstract to incorporate a small number of key quantitative highlights (e.g., average accuracy gains) while preserving its summary nature. revision: yes

  2. Referee: [Abstract / Method (implied)] The decomposition of scene-text edits into rendering and erasure primitives is asserted to achieve 'precise control over both text appearance and background integrity' without loss of fidelity, but no section, equation, or experiment is cited that tests whether this reduction introduces artifacts requiring task-specific post-processing, which is the weakest assumption identified.

    Authors: The decomposition into rendering and erasure is the architectural foundation described in Section 3, with ORPE and RAS explicitly designed to maintain fidelity; the unified model is evaluated end-to-end on all three tasks in Section 4 without any task-specific post-processing, and both quantitative metrics and qualitative results demonstrate clean outputs. We acknowledge that an explicit, dedicated experiment isolating potential decomposition-induced artifacts is not separately highlighted. We will therefore add a short paragraph in Section 3.2 citing the relevant experimental evidence that no such artifacts appear and no post-processing is required. revision: partial

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The provided abstract and description introduce TextWand as a framework that decomposes tasks into rendering and erasure primitives, defines new components ORPE and RAS, and constructs a new benchmark TextWand-Bench. No equations, fitted parameters, self-citations, or derivations are described that reduce any claimed result to an input by construction. The central claims rest on experimental comparisons rather than self-referential definitions or renamed known results, making the derivation chain self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 2 invented entities

Abstract-only review yields no explicit free parameters, mathematical axioms, or invented physical entities; the two named techniques are engineering designs rather than new postulated objects.

invented entities (2)
  • Overlay-Reference Positional Encoding (ORPE) no independent evidence
    purpose: Enforce pixel-level layout fidelity and exemplar-driven style control
    Novel design introduced to support the unified editing pipeline.
  • Region-Adaptive Suppression (RAS) no independent evidence
    purpose: Ensure clean text erasure without background damage
    New strategy presented for the erasure primitive.

pith-pipeline@v0.9.1-grok · 5693 in / 1344 out tokens · 29388 ms · 2026-06-28T01:53:57.838739+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

62 extracted references · 21 canonical work pages · 10 internal anchors

  1. [1]

    arXiv preprint arXiv:2504.21682 (2025)

    Shu, Y., Zeng, W., Zhao, F., Chen, Z., Li, Z., Yang, X., Zhou, Y., Rota, P., Bai, X., Jin, L., et al.: Visual text processing: A compre- hensive review and unified evaluation. arXiv preprint arXiv:2504.21682 (2025)

  2. [2]

    In: Proceedings of the 31st ACM International Conference on Multimedia, pp

    Yang, F., Su, T., Zhou, X., Di, D., Wang, Z., Li, S.: Self-supervised cross-language scene text editing. In: Proceedings of the 31st ACM International Conference on Multimedia, pp. 4546–4554 (2023)

  3. [3]

    In: European Conference on Computer Vision, pp

    Lee, H., Choi, C.: The surprisingly straight- forward scene text removal method with gated attention and region of interest genera- tion: A comprehensive prominent model anal- ysis. In: European Conference on Computer Vision, pp. 457–472 (2022). Springer

  4. [4]

    In: International Con- ference on Machine Learning, pp

    Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J.,et al.: Learn- ing transferable visual models from natural language supervision. In: International Con- ference on Machine Learning, pp. 8748–8763 (2021). PmLR

  5. [5]

    APSIPA Transactions on Signal and Information Processing13(1) (2024)

    Bai, Y., Huang, Z., Gao, W., Yang, S., Liu, J., et al.: Intelligent artistic typography: A com- prehensive review of artistic text design and Article Title15 generation. APSIPA Transactions on Signal and Information Processing13(1) (2024)

  6. [6]

    Qwen-Image Technical Report

    Wu, C., Li, J., Zhou, J., Lin, J., Gao, K., Yan, K., Yin, S.-m., Bai, S., Xu, X., Chen, Y., et al.: Qwen-image technical report. arXiv preprint arXiv:2508.02324 (2025)

  7. [7]

    Labs, B.F., Batifol, S., Blattmann, A., Boe- sel, F., Consul, S., Diagne, C., Dockhorn, T., English, J., English, Z., Esser, P., Kulal, S., Lacey, K., Levi, Y., Li, C., Lorenz, D., M¨ uller, J., Podell, D., Rombach, R., Saini, H., Sauer, A., Smith, L.: FLUX.1 Kontext: Flow Matching for In-Context Image Generation and Editing in Latent Space (2025)

  8. [8]

    In: Forty-first International Conference on Machine Learning (2024)

    Esser, P., Kulal, S., Blattmann, A., Entezari, R., M¨ uller, J., Saini, H., Levi, Y., Lorenz, D., Sauer, A., Boesel, F.,et al.: Scaling rec- tified flow transformers for high-resolution image synthesis. In: Forty-first International Conference on Machine Learning (2024)

  9. [9]

    HunyuanImage 3.0 Technical Report

    Cao, S., Chen, H., Chen, P., Cheng, Y., Cui, Y., Deng, X., Dong, Y., Gong, K., Gu, T., Gu, X., et al.: Hunyuanimage 3.0 technical report. arXiv preprint arXiv:2509.23951 (2025)

  10. [10]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp

    Shi, W., Song, Y., Zhang, D., Liu, J., Zou, X.: Fonts: Text rendering with typogra- phy and style controls. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 18463–18474 (2025)

  11. [11]

    arXiv preprint arXiv:2505.03329 (2025)

    Lan, R., Bai, Y., Duan, X., Li, M., Jin, D., Xu, R., Nie, D., Sun, L., Chu, X.: Flux-text: A simple and advanced diffusion transformer baseline for scene text editing. arXiv preprint arXiv:2505.03329 (2025)

  12. [12]

    arXiv preprint arXiv:2411.15245 (2024)

    Tuo, Y., Geng, Y., Bo, L.: Anytext2: Visual text generation and editing with customizable attributes. arXiv preprint arXiv:2411.15245 (2024)

  13. [13]

    In: 2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp

    Wang, T., Liu, T., Qu, X., Wu, C., Liu, L., Hu, X.: Glyphmastero: A glyph encoder for high-fidelity scene text editing. In: 2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 28523– 28532 (2025)

  14. [14]

    arXiv preprint arXiv:2311.03054 (2023)

    Tuo, Y., Xiang, W., He, J.-Y., Geng, Y., Xie, X.: Anytext: Multilingual visual text generation and editing. arXiv preprint arXiv:2311.03054 (2023)

  15. [15]

    In: Proceed- ings of the SIGGRAPH Asia 2025 Conference Papers, pp

    Zhao, Y., Gao, Y., Luo, Y., Duan, J., Lin, S., Xiong, L., Lian, Z.: Utdesign: A unified framework for stylized text editing and gen- eration in graphic design images. In: Proceed- ings of the SIGGRAPH Asia 2025 Conference Papers, pp. 1–11 (2025)

  16. [16]

    Advances in Neural Information Processing Systems37, 138569–138594 (2024)

    Zeng, W., Shu, Y., Li, Z., Yang, D., Zhou, Y.: Textctrl: Diffusion-based scene text edit- ing with prior guidance control. Advances in Neural Information Processing Systems37, 138569–138594 (2024)

  17. [17]

    In: 2017 14th IAPR International Conference on Document Anal- ysis and Recognition (ICDAR), vol

    Nakamura, T., Zhu, A., Yanai, K., Uchida, S.: Scene text eraser. In: 2017 14th IAPR International Conference on Document Anal- ysis and Recognition (ICDAR), vol. 1, pp. 832–837 (2017). IEEE

  18. [18]

    Computer Vision and Image Under- standing201, 103066 (2020)

    Tursun, O., Denman, S., Zeng, R., Siva- palan, S., Sridharan, S., Fookes, C.: Mtr- net++: One-stage mask-based scene text eraser. Computer Vision and Image Under- standing201, 103066 (2020)

  19. [19]

    IEEE Transactions on Image Processing30, 9306–9320 (2021)

    Tang, Z., Miyazaki, T., Sugaya, Y., Omachi, S.: Stroke-based scene text erasing using syn- thetic data for training. IEEE Transactions on Image Processing30, 9306–9320 (2021)

  20. [20]

    IEEE Transactions on Image Processing32, 4567– 4580 (2023)

    Wang, Y., Xie, H., Wang, Z., Qu, Y., Zhang, Y.: What is the real need for scene text removal? exploring the background integrity and erasure exhaustivity properties. IEEE Transactions on Image Processing32, 4567– 4580 (2023)

  21. [21]

    arXiv preprint arXiv:2505.24417 (2025)

    Lu, R., Zhang, Y., Liu, J., Wang, H., Song, Y.: Easytext: Controllable diffusion trans- former for multilingual text rendering. arXiv preprint arXiv:2505.24417 (2025)

  22. [22]

    arXiv preprint arXiv:2510.24093 (2025) 16Article Title

    Gunawan, A., Teodoro, S., Chen, Y., Kim, S.Y., Oh, J., Kim, M.: Omnitext: A training-free generalist for controllable text-image manipulation. arXiv preprint arXiv:2510.24093 (2025) 16Article Title

  23. [23]

    In: European Conference on Com- puter Vision, pp

    Liu, Z., Liang, W., Liang, Z., Luo, C., Li, J., Huang, G., Yuan, Y.: Glyph-byt5: A cus- tomized text encoder for accurate visual text rendering. In: European Conference on Com- puter Vision, pp. 361–377 (2024). Springer

  24. [24]

    Advances in Neural Infor- mation Processing Systems36, 9353–9387 (2023)

    Chen, J., Huang, Y., Lv, T., Cui, L., Chen, Q., Wei, F.: Textdiffuser: Diffusion models as text painters. Advances in Neural Infor- mation Processing Systems36, 9353–9387 (2023)

  25. [25]

    In: European Conference on Computer Vision, pp

    Chen, J., Huang, Y., Lv, T., Cui, L., Chen, Q., Wei, F.: Textdiffuser-2: Unleashing the power of language models for text render- ing. In: European Conference on Computer Vision, pp. 386–402 (2024). Springer

  26. [26]

    Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-Resolution Image Synthesis with Latent Diffusion Models (2021)

  27. [27]

    Labs, B.F.: FLUX (2024)

  28. [28]

    SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis

    Podell, D., English, Z., Lacey, K., Blattmann, A., Dockhorn, T., M¨ uller, J., Penna, J., Rom- bach, R.: Sdxl: Improving latent diffusion models for high-resolution image synthesis. arXiv preprint arXiv:2307.01952 (2023)

  29. [29]

    Z-Image: An Efficient Image Generation Foundation Model with Single-Stream Diffusion Transformer

    Team, Z.-I.: Z-image: An efficient image generation foundation model with single- stream diffusion transformer. arXiv preprint arXiv:2511.22699 (2025)

  30. [30]

    Seedream 4.0: Toward Next-generation Multimodal Image Generation

    Seedream, T., Chen, Y., Gao, Y., Gong, L., Guo, M., Guo, Q., Guo, Z., Hou, X., Huang, W., Huang, Y., et al.: Seedream 4.0: Toward next-generation multimodal image generation. arXiv preprint arXiv:2509.20427 (2025)

  31. [31]

    Step1X-Edit: A Practical Framework for General Image Editing

    Liu, S., Han, Y., Xing, P., Yin, F., Wang, R., Cheng, W., Liao, J., Wang, Y., Fu, H., Han, C., Li, G., Peng, Y., Sun, Q., Wu, J., Cai, Y., Ge, Z., Ming, R., Xia, L., Zeng, X., Zhu, Y., Jiao, B., Zhang, X., Yu, G., Jiang, D.: Step1x-edit: A practical framework for general image editing. arXiv preprint arXiv:2504.17761 (2025)

  32. [32]

    In: The Thirty- ninth Annual Conference on Neural Informa- tion Processing Systems (2025)

    Wang, J., Chen, Y., Yu, J., Lu, G., Pei, W.: Editinfinity: Image editing with binary- quantized generative models. In: The Thirty- ninth Annual Conference on Neural Informa- tion Processing Systems (2025)

  33. [33]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp

    Brooks, T., Holynski, A., Efros, A.A.: Instructpix2pix: Learning to follow image editing instructions. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 18392–18402 (2023)

  34. [34]

    arXiv preprint arXiv:2505.19149 (2025)

    Wang, S., Li, W., Wang, Q., Zhao, S., Zhang, J.: Mind-edit: Mllm insight-driven editing via language-vision projection. arXiv preprint arXiv:2505.19149 (2025)

  35. [35]

    In: Proceedings of IEEE/CVF Inter- national Conference on Computer Vision (ICCV) (2023)

    Zhang, L., Rao, A., Agrawala, M.: Adding conditional control to text-to-image diffusion models. In: Proceedings of IEEE/CVF Inter- national Conference on Computer Vision (ICCV) (2023)

  36. [36]

    In: Pro- ceedings of the AAAI Conference on Artificial Intelligence, vol

    Mou, C., Wang, X., Xie, L., Wu, Y., Zhang, J., Qi, Z., Shan, Y.: T2i-adapter: Learning adapters to dig out more controllable ability for text-to-image diffusion models. In: Pro- ceedings of the AAAI Conference on Artificial Intelligence, vol. 38, pp. 4296–4304 (2024)

  37. [37]

    In: European Conference on Computer Vision, pp

    Ju, X., Liu, X., Wang, X., Bian, Y., Shan, Y., Xu, Q.: Brushnet: A plug-and-play image inpainting model with decomposed dual- branch diffusion. In: European Conference on Computer Vision, pp. 150–168 (2024). Springer

  38. [38]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recogni- tion, pp

    Mou, C., Wang, X., Song, J., Shan, Y., Zhang, J.: Diffeditor: Boosting accuracy and flexibility on diffusion-based image editing. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recogni- tion, pp. 8488–8497 (2024)

  39. [39]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2025) Article Title17

    Liu, Z., Yu, Y., Ouyang, H., Wang, Q., Cheng, K.L., Wang, W., Liu, Z., Chen, Q., Shen, Y.: Magicquill: An intelligent interac- tive image editing system. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2025) Article Title17

  40. [40]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp

    Shi, Y., Xue, C., Liew, J.H., Pan, J., Yan, H., Zhang, W., Tan, V.Y., Bai, S.: Dragdiffusion: Harnessing diffusion models for interactive point-based image editing. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8839– 8849 (2024)

  41. [41]

    In: Proceedings of the Computer Vision and Pattern Recognition Conference, pp

    Park, J., Gim, J., Lee, K., Lee, S., Im, S.: Style-editor: Text-driven object-centric style editing. In: Proceedings of the Computer Vision and Pattern Recognition Conference, pp. 18281–18291 (2025)

  42. [42]

    In: ICASSP 2025-2025 IEEE International Con- ference on Acoustics, Speech and Signal Pro- cessing (ICASSP), pp

    Dai, M., Zhou, Q., Yi, R., Ma, L.: Diffusefist: A fast image-guided style transfer method for adapting large-scale diffusion models. In: ICASSP 2025-2025 IEEE International Con- ference on Acoustics, Speech and Signal Pro- cessing (ICASSP), pp. 1–5 (2025). IEEE

  43. [43]

    Uniworld-V2: Reinforce Image Editing with Diffusion Negative-aware Finetuning and MLLM Implicit Feedback

    Li, Z., Liu, Z., Zhang, Q., Lin, B., Wu, F., Yuan, S., Yan, Z., Ye, Y., Yu, W., Niu, Y., et al.: Uniworld-v2: Reinforce image edit- ing with diffusion negative-aware finetuning and mllm implicit feedback. arXiv preprint arXiv:2510.16888 (2025)

  44. [44]

    In: Proceedings of the Computer Vision and Pattern Recogni- tion Conference, pp

    Yu, Q., Chow, W., Yue, Z., Pan, K., Wu, Y., Wan, X., Li, J., Tang, S., Zhang, H., Zhuang, Y.: Anyedit: Mastering unified high-quality image editing for any idea. In: Proceedings of the Computer Vision and Pattern Recogni- tion Conference, pp. 26125–26135 (2025)

  45. [45]

    UniWorld-V1: High-Resolution Semantic Encoders for Unified Visual Understanding and Generation

    Lin, B., Li, Z., Cheng, X., Niu, Y., Ye, Y., He, X., Yuan, S., Yu, W., Wang, S., Ge, Y., et al.: Uniworld-v1: High-resolution semantic encoders for unified visual understanding and generation. arXiv preprint arXiv:2506.03147 (2025)

  46. [46]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp

    Kulikov, V., Kleiner, M., Huberman- Spiegelglas, I., Michaeli, T.: Flowedit: Inversion-free text-based editing using pre- trained flow models. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 19721–19730 (2025)

  47. [47]

    In: ACM SIGGRAPH 2024 Conference Papers, pp

    Liu, Y., Lian, Z.: Qt-font: High-efficiency font synthesis via quadtree-based diffusion mod- els. In: ACM SIGGRAPH 2024 Conference Papers, pp. 1–11 (2024)

  48. [48]

    arXiv preprint arXiv:2304.10097 (2023)

    Su, T., Yang, F., Zhou, X., Di, D., Wang, Z., Li, S.: Scene style text editing. arXiv preprint arXiv:2304.10097 (2023)

  49. [49]

    arXiv preprint arXiv:2304.05568 (2023)

    Ji, J., Zhang, G., Wang, Z., Hou, B., Zhang, Z., Price, B., Chang, S.: Improving diffu- sion models for scene text editing with dual encoders. arXiv preprint arXiv:2304.05568 (2023)

  50. [50]

    In: Proceedings of the AAAI Con- ference on Artificial Intelligence, vol

    Yang, Z., Peng, D., Kong, Y., Zhang, Y., Yao, C., Jin, L.: Fontdiffuser: One-shot font gener- ation via denoising diffusion with multi-scale content aggregation and style contrastive learning. In: Proceedings of the AAAI Con- ference on Artificial Intelligence, vol. 38, pp. 6603–6611 (2024)

  51. [51]

    In: European Conference on Computer Vision, pp

    Nikolaidou, K., Retsinas, G., Sfikas, G., Liwicki, M.: Diffusionpen: towards control- ling the style of handwritten text genera- tion. In: European Conference on Computer Vision, pp. 417–434 (2024). Springer

  52. [52]

    In: 2019 International Confer- ence on Document Analysis and Recognition (ICDAR), pp

    Nakamura, T.N., Zhu, A., Uchida, S.: Scene text magnifier. In: 2019 International Confer- ence on Document Analysis and Recognition (ICDAR), pp. 825–830 (2019)

  53. [53]

    In: Proceedings of the Computer Vision and Pattern Recognition Conference (CVPR), pp

    Fang, Z., Lyu, P., Wu, J., Zhang, C., Yu, J., Lu, G., Pei, W.: Recognition-synergistic scene text editing. In: Proceedings of the Computer Vision and Pattern Recognition Conference (CVPR), pp. 13104–13113 (2025)

  54. [54]

    In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G

    Zhao, Y., Lian, Z.: Udifftext: A unified frame- work for high-quality text synthesis in arbi- trary images via character-aware diffusion models. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds.) European Conference on Computer Vision. Springer (2024)

  55. [55]

    arXiv preprint arXiv:2502.10999 (2025) 18Article Title

    Jiang, B., Yuan, Y., Bai, X., Hao, Z., Yin, A., Hu, Y., Liao, W., Ungar, L., Taylor, C.J.: Controltext: Unlocking controllable fonts in multilingual text rendering without font annotations. arXiv preprint arXiv:2502.10999 (2025) 18Article Title

  56. [56]

    In: International Conference on Learning Representations

    Kingma, D.P., Welling, M.,et al.: Auto- encoding variational bayes. In: International Conference on Learning Representations. Banff, Canada (2014)

  57. [57]

    arXiv preprint arXiv:2506.10741 (2025)

    Chen, S., Lai, J., Gao, J., Ye, T., Chen, H., Shi, H., Shao, S., Lin, Y., Fei, S., Xing, Z., Jin, Y., Luo, J., Wei, X., Zhu, L.: Postercraft: Rethinking high-quality aesthetic poster gen- eration in a unified framework. arXiv preprint arXiv:2506.10741 (2025)

  58. [58]

    In: International Conference on Learning Representations (2022)

    Hu, E.J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., Chen, W.: LoRA: Low-rank adaptation of large language mod- els. In: International Conference on Learning Representations (2022)

  59. [59]

    In: Proceedings of the IEEE Conference on Com- puter Vision and Pattern Recognition, pp

    Zhang, R., Isola, P., Efros, A.A., Shechtman, E., Wang, O.: The unreasonable effectiveness of deep features as a perceptual metric. In: Proceedings of the IEEE Conference on Com- puter Vision and Pattern Recognition, pp. 586–595 (2018)

  60. [60]

    IEEE Transactions on Image Processing13(4), 600–612 (2004)

    Wang, Z., Bovik, A.C., Sheikh, H.R., Simon- celli, E.P.: Image quality assessment: from error visibility to structural similarity. IEEE Transactions on Image Processing13(4), 600–612 (2004)

  61. [61]

    Qwen3-VL Technical Report

    Bai, S., Cai, Y., Chen, R., Chen, K., Chen, X., Cheng, Z., Deng, L., Ding, W., Gao, C., Ge, C., et al.: Qwen3-vl technical report. arXiv preprint arXiv:2511.21631 (2025)

  62. [62]

    LongCat-Image Technical Report

    Team, M.L., Ma, H., Tan, H., Huang, J., Wu, J., He, J.-Y., Gao, L., Xiao, S., Wei, X., Ma, X., Cai, X., Guan, Y., Hu, J.: Longcat-image technical report. arXiv preprint arXiv:2512.07584 (2025)