pith. sign in

arxiv: 2606.24484 · v1 · pith:RIQSKTNEnew · submitted 2026-06-23 · 💻 cs.CV

Advancing WordArt-Oriented Scene Text Recognition: Datasets and Methods

Pith reviewed 2026-06-26 01:03 UTC · model grok-4.3

classification 💻 cs.CV
keywords scene text recognitionWordArtsynthetic datasetautoregressive decoderartistic textarbitrary shapevision encoder
0
0 comments X

The pith

A 2-million-image synthetic dataset paired with an arbitrary-shape encoder and autoregressive decoder reaches 90.4 percent accuracy on WordArt scene text recognition.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper creates a 2M-image synthetic dataset WATER-S that combines an upgraded rendering pipeline with AI-driven image synthesis to supply diverse artistic text examples at a scale hundreds of times larger than prior collections. It introduces the WATERec model, whose visual encoder accepts arbitrary shapes and whose autoregressive decoder handles complex layouts that defeat fixed-template recognizers. Experiments show this combination delivers 90.40 percent accuracy on WordArt-Bench and exceeds both general vision-language models and specialized OCR systems. Readers would care because standard scene-text tools routinely fail on the customized fonts, textures, and arrangements found in design, advertising, and packaging.

Core claim

By generating two complementary subsets of WATER-S—one via an upgraded SynthWordArt renderer and one via Qwen3-VL prompt mining plus Z-Image synthesis—and training WATERec with an arbitrary-shape visual encoder plus autoregressive decoder, the system attains 90.40 percent accuracy on WordArt-Bench while surpassing prior STR methods and both general-purpose and OCR-specialized vision-language models.

What carries the argument

WATERec, a model that pairs a visual encoder supporting arbitrary-shaped inputs with an autoregressive decoder to model complex layouts.

If this is right

  • The architecture removes the fixed-template bottleneck that limits conventional scene-text recognizers on irregular artistic text.
  • Performance on WordArt-Bench exceeds both general vision-language models and OCR-specialized models by a large margin.
  • The two-part synthetic construction supplies controllable yet diverse training data at a scale previously unavailable for this task.
  • Reorganization of existing real STR data into WATER-R further strengthens the baseline when combined with the new synthetic set.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same synthesis-plus-autoregressive approach could be tested on related stylized-text domains such as comic lettering or product logos.
  • Real deployment would likely require targeted fine-tuning on domain-specific artistic styles not fully represented in the current generation pipeline.
  • The method suggests a template for scaling recognition in other data-scarce visual domains by pairing large synthetic sets with layout-aware decoders.

Load-bearing premise

The synthetic images produced by the upgraded rendering pipeline and the AI synthesis tools capture the visual diversity and real-world challenges of artistic text without introducing biases that block generalization.

What would settle it

Measuring accuracy on a fresh collection of real-world photographs of artistic text that were never used in the synthetic generation or the WordArt-Bench construction would directly test whether performance holds outside the created data.

Figures

Figures reproduced from arXiv: 2606.24484 by Chen Li, Chong Sun, Haojie Zhang, Jiaxin Zhang, Jing Lyu, Xingsong Ye, Yongkun Du, Zhineng Chen.

Figure 1
Figure 1. Figure 1: Top: Two subsets’ examples of our synthetic datasets (WATER-S) and the real artistic text benchmark (WordArt-Bench). Bottom: Recognition accuracy on WordArt-Bench for various STR methods and VLMs. For each STR entry, the term before “/” denotes the model and the term after “/” denotes the training data used. “R” indicates existing real datasets only, while “RS” indicates adding our synthetic data. font sha… view at source ↗
Figure 2
Figure 2. Figure 2: Pipelines of our two synthetic datasets. The top illustrates the WATER-T synthesis pipeline: the art-text-oriented tool SynthWordArt renders provided artistic fonts, background images, and real text corpora into artistic text images. The bottom shows the WATER-Z synthesis pipeline: A small set of real artistic text images are fed into Qwen3-VL (8B) [1] to obtain detailed captions. Then Qwen3-VL is leverage… view at source ↗
Figure 3
Figure 3. Figure 3: The overall architecture of WATERec. It consists of a Vision Transformer Encoder that uses RoPE Attention (illustrated on the right) to support inputs of arbitrary shape, and an AR Transformer Decoder. It also shows images of different sizes, which are processed into tokens of different lengths. <B> denotes the beginning token of decoding, and <E> denotes the end token of decoding. 4.2 Arbitrary-Shaped Inp… view at source ↗
Figure 4
Figure 4. Figure 4: Comparison of feature maps from different encoder outputs. Top: Images fed into the WATERec while preserving aspect ratio, and the corresponding encoder feature maps. Bottom: Baseline with fixed-template inputs (model in the first row of Tab. 4). 5.4 More Visualization and Analysis Arbitrary-shape modeling directs more robust attention. To intuitively verify the effectiveness of our arbitrary-shaped input … view at source ↗
Figure 5
Figure 5. Figure 5: Left: Visualization of different models’ predictions on the WordArt-Bench, a supplement to [PITH_FULL_IMAGE:figures/full_fig_p014_5.png] view at source ↗
Figure 1
Figure 1. Figure 1: Word cloud of artistic font tags used in WATER-T. A More Data Details A.1 Resources Used and Generated in WATER-S Artistic Fonts We first collect artistic font resources from open-source font platforms, design asset websites, and public code repositories. To ensure both stylistic diversity and licensing compliance, we only retain fonts that satisfy the following conditions: (1) The font description or tags… view at source ↗
Figure 2
Figure 2. Figure 2: Distribution of representative character proportions across different corpora, with all letters converted to lowercase. Caption Extraction Prompt: “I am providing an image of an artistic text area that has been cropped from a real photograph. “ f“The original text within the image is “{text_label}“.” “Please generate a prompt template that can be used to create images in the same style as this artistic tex… view at source ↗
Figure 3
Figure 3. Figure 3: Visualization of synthesized multilingual examples. a 1M-scale Chinese WATER-S by replacing the English corpus with a Chinese one, and select 101 Chinese WordArt samples from BCTR-Test [4] as a small evaluation set (visualized in [PITH_FULL_IMAGE:figures/full_fig_p023_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Visualization of the Chinese WordArt test set selected from BCTR-Test [4]. A.6 Visualization of WATER-S [PITH_FULL_IMAGE:figures/full_fig_p023_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: More visualizations of the WATER-T dataset [PITH_FULL_IMAGE:figures/full_fig_p024_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: More visualizations of the WATER-Z dataset. B More Model Details B.1 Position Embedding APE Implementation Each image is first partitioned into non-overlapping patches and flattened into a token sequence \mathbf {x}_n \in \mathbb {R}^{d} . For each patch we also compute its 2D grid coordinate \mathbf {p}_n=(p_n^x,p_n^y) . We parameterize absolute position as two learnable lookup tables, one for the height … view at source ↗
Figure 7
Figure 7. Figure 7: More bad-case examples from our final model (WATERec / RS), and the in￾formation below each image is formatted as: label | prediction | confidence [PITH_FULL_IMAGE:figures/full_fig_p029_7.png] view at source ↗
read the original abstract

WordArt (artistic text) features highly customized fonts, textures, and layouts, making WordArt-oriented scene TExt Recognition (WATER) substantially more challenging than general Scene Text Recognition (STR). Existing STR datasets and methods, typically built around regular scene text and fixed-template inputs, struggle to scale to WATER. Thus, we aim to advance this task from both data and model perspectives. On the data side, we construct a 2M synthetic dataset, WATER-S, with the scale improved by hundreds of times compared to existing artistic text data. WATER-S consists of two complementary subsets. One rendered by an upgraded rendering pipeline (SynthWordArt), which provides highly accurate and controllable synthetic WordArt data. The other is generated by combining Qwen3-VL for prompt mining and Z-Image for image synthesis, which improves the coverage of realistic and diverse data. On the model side, we propose WATERec. It adopts an visual encoder supporting arbitrary-shaped inputs and an autoregressive decoder to model complex layouts, structurally breaking the bottleneck of fixed-template STR on WordArt. Experiments show that this architecture outperforms prior STR methods, achieving state-of-the-art performance on irregular texts such as WordArt. Together with WATER-R, carefully reorganized from existing real STR data, our strong baseline with the new synthetic data and model design reaches 90.40% accuracy on WordArt-Bench, surpassing both general-purpose and OCR-specialized vision-language models by a large margin. Code and data are available at https://github.com/YesianRohn/WATER.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript claims to advance WordArt-oriented scene text recognition (WATER) by constructing a 2M-image synthetic dataset WATER-S via an upgraded SynthWordArt pipeline and Qwen3-VL/Z-Image AI synthesis, reorganizing existing real data into WATER-R, and introducing the WATERec architecture with an arbitrary-shaped visual encoder plus autoregressive decoder. Experiments are reported to yield 90.40% accuracy on WordArt-Bench, outperforming both general-purpose and OCR-specialized vision-language models, with public code and data release.

Significance. If the empirical results hold, the work provides a substantial contribution to scene text recognition by scaling artistic-text data by hundreds of times and structurally departing from fixed-template STR architectures. The public release of code and data is an explicit strength supporting reproducibility and future benchmarking.

major comments (2)
  1. [Abstract] Abstract: The central performance claim of 90.40% accuracy and SOTA status is asserted without any specification of the evaluation metric (word accuracy, character accuracy, etc.), the exact baselines compared, data splits used, or controls for the synthetic-to-real domain gap. This information is load-bearing for assessing whether the reported margin over prior methods is robust.
  2. [Experiments] Experiments section: No error bars, multiple runs, or statistical significance tests are referenced for the 90.40% result or the architectural improvements, undermining confidence in the claim that WATERec plus the new data reliably surpasses existing STR and VLM approaches.
minor comments (2)
  1. [Abstract] The abstract introduces 'WordArt-Bench' without a one-sentence definition or pointer to its construction, which would aid readers unfamiliar with the benchmark.
  2. Notation for the two subsets of WATER-S (SynthWordArt-rendered vs. AI-synthesized) could be introduced more explicitly when first mentioned to improve readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on clarity and experimental robustness. We address each major comment below, indicating planned revisions where appropriate.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central performance claim of 90.40% accuracy and SOTA status is asserted without any specification of the evaluation metric (word accuracy, character accuracy, etc.), the exact baselines compared, data splits used, or controls for the synthetic-to-real domain gap. This information is load-bearing for assessing whether the reported margin over prior methods is robust.

    Authors: We agree the abstract requires greater precision. The reported 90.40% is word-level accuracy on the WordArt-Bench test split following the standard STR evaluation protocol (exact match after normalization). Baselines encompass both prior STR models (e.g., ABINet, PARSeq) and VLMs (e.g., Qwen-VL, GPT-4V). Data splits match the public WordArt-Bench definitions, and domain-gap controls consist of training exclusively on synthetic WATER-S while evaluating solely on real images in WATER-R and WordArt-Bench. We will revise the abstract to state these details explicitly. revision: yes

  2. Referee: [Experiments] Experiments section: No error bars, multiple runs, or statistical significance tests are referenced for the 90.40% result or the architectural improvements, undermining confidence in the claim that WATERec plus the new data reliably surpasses existing STR and VLM approaches.

    Authors: We acknowledge that error bars and multi-run statistics would increase confidence. All results derive from single training runs, consistent with common practice for large-scale (2M-image) experiments in the STR literature. We will add an explicit limitations paragraph in the experiments section noting the single-run nature and computational rationale. We cannot supply new multi-seed results or significance tests without substantial additional compute. revision: partial

Circularity Check

0 steps flagged

No significant circularity; purely empirical contribution

full rationale

The paper advances WATER via dataset construction (2M synthetic WATER-S from upgraded SynthWordArt pipeline plus Qwen3-VL/Z-Image synthesis) and a new architecture (WATERec: arbitrary-shape visual encoder + autoregressive decoder), plus reorganized real data WATER-R. The central claim is the empirical result of 90.40% accuracy on WordArt-Bench. No equations, derivations, fitted parameters renamed as predictions, or load-bearing self-citations appear. All steps are data/model engineering evaluated on external benchmarks; the chain is self-contained and falsifiable outside any internal fit.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 0 invented entities

The central claims rest on the domain assumption that large-scale synthetic data plus flexible architecture overcomes fixed-template limitations; no new physical entities or ad-hoc constants are introduced.

free parameters (1)
  • Model training hyperparameters
    Standard deep learning parameters (learning rate, batch size, etc.) tuned on held-out data; not enumerated in abstract but implicit in any neural training.
axioms (2)
  • domain assumption Synthetic data from upgraded rendering and VLM-guided synthesis approximates real WordArt distributions sufficiently for model training and evaluation
    Invoked in the data construction and performance claims sections of the abstract.
  • domain assumption Autoregressive decoding can effectively capture complex text layouts where fixed-template methods fail
    Basis for proposing the WATERec decoder architecture.

pith-pipeline@v0.9.1-grok · 5832 in / 1385 out tokens · 31712 ms · 2026-06-26T01:03:42.607403+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

57 extracted references · 2 canonical work pages

  1. [1]

    CoRRabs/2511.21631(2025)

    Bai, S., Cai, Y., Chen, R., Chen, K., Chen, X., Cheng, Z., Deng, L., Ding, W., Gao, C., Ge, C., Ge, W., Guo, Z., Huang, Q., Huang, J., Huang, F., Hui, B., Jiang, S., Li, Z., Li, M., Li, M., Li, K., Lin, Z., Lin, J., Liu, X., Liu, J., Liu, C., Liu, Y., Liu, D., Liu, S., Lu, D., Luo, R., Lv, C., Men, R., Meng, L., Ren, X., Ren, X., Song, S., Sun, Y., Tang, ...

  2. [2]

    In: ECCV

    Bautista, D., Atienza, R.: Scene text recognition with permuted autoregressive sequence models. In: ECCV. pp. 178–196 (2022)

  3. [3]

    CoRRabs/2511.22699(2025)

    Cai, H., Cao, S., Du, R., Gao, P., Hoi, S., Hou, Z., Huang, S., Jiang, D., Jin, X., Li, L., et al.: Z-image: An efficient image generation foundation model with single-stream diffusion transformer. CoRRabs/2511.22699(2025)

  4. [4]

    CoRRabs/2112.15093(2021)

    Chen, J., Yu, H., Ma, J., Guan, M., Xu, X., Wang, X., Qu, S., Li, B., Xue, X.: Benchmarkingchinesetextrecognition:Datasets,baselines,andanempiricalstudy. CoRRabs/2112.15093(2021)

  5. [5]

    In: CVPR

    Cui, C., Sun, T., Liang, S., Gao, T., Zhang, Z., Liu, J., Wang, X., Zhou, C., Liu, H., Lin, M., et al.: Boosting document parsing efficiency and performance with coarse-to-fine visual processing. In: CVPR. pp. 16655–16665 (2026)

  6. [6]

    CoRRabs/2507.05595(2025)

    Cui, C., Sun, T., Lin, M., Gao, T., Zhang, Y., Liu, J., Wang, X., Zhang, Z., Zhou, C., Liu, H., et al.: Paddleocr 3.0 technical report. CoRRabs/2507.05595(2025)

  7. [7]

    NeurIPS36, 2252– 2274 (2023)

    Dehghani, M., Mustafa, B., Djolonga, J., Heek, J., Minderer, M., Caron, M., Steiner, A., Puigcerver, J., Geirhos, R., Alabdulmohsin, I.M., et al.: Patch n’pack: Navit, a vision transformer for any aspect ratio and resolution. NeurIPS36, 2252– 2274 (2023)

  8. [8]

    CoRRabs/2511.03929(2025)

    Deshmukh, A.S., Chumachenko, K., Rintamaki, T., Le, M., Poon, T., Taheri, D.M., Karmanov, I., Liu, G., Seppanen, J., Chen, G., et al.: Nvidia nemotron nano v2 vl. CoRRabs/2511.03929(2025)

  9. [9]

    In: ICLR (2021)

    Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021)

  10. [10]

    In: AAAI

    Du, Y., Chen, Z., Jia, C., Gao, X., Jiang, Y.G.: Out of length text recognition with sub-string matching. In: AAAI. vol. 39, pp. 2798–2806 (2025)

  11. [11]

    IEEE Trans

    Du, Y., Chen, Z., Jia, C., Yin, X., Li, C., Du, Y., Jiang, Y.G.: Context perception parallel decoder for scene text recognition. IEEE Trans. Pattern Anal. Mach. Intell. 47(6), 4668–4683 (2025).https://doi.org/10.1109/TPAMI.2025.3545453

  12. [12]

    In: IJCAI

    Du, Y., Chen, Z., Jia, C., Yin, X., Zheng, T., Li, C., Du, Y., Jiang, Y.G.: SVTR: Scene text recognition with a single visual model. In: IJCAI. pp. 884–890 (2022)

  13. [13]

    In: ICCV

    Du, Y., Chen, Z., Xie, H., Jia, C., Jiang, Y.G.: Svtrv2: Ctc beats encoder-decoder models in scene text recognition. In: ICCV. pp. 20147–20156 (2025) 16 X. Ye et al

  14. [14]

    IEEE Trans

    Du, Y., Chen, Z., Yuchen, S., Jia, C., Jiang, Y.G.: Instruction-guided scene text recognition. IEEE Trans. Pattern Anal. Mach. Intell.47(4), 2723–2738 (2025)

  15. [15]

    In: AAAI

    Du, Y., Zhao, M., Fan, S., Chen, Z., Jia, C., Jiang, Y.G.: Mdiff4str: Mask diffusion model for scene text recognition. In: AAAI. vol. 40, pp. 3705–3713 (2026)

  16. [16]

    In: CVPR

    Fang, S., Xie, H., Wang, Y., Mao, Z., Zhang, Y.: Read like humans: Autonomous, bidirectional and iterative language modeling for scene text recognition. In: CVPR. pp. 7098–7107 (2021)

  17. [17]

    In: ICML

    Graves, A., Fernández, S., Gomez, F., Schmidhuber, J.: Connectionist temporal classification: Labelling unsegmented sequence data with recurrent neural net- works. In: ICML. pp. 369–376 (2006)

  18. [18]

    IEEE Trans

    Guan, T., Shen, W., Yang, X.: Ccdplus: Towards accurate character to character distillation for text recognition. IEEE Trans. Pattern Anal. Mach. Intell.47(5), 3546–3562 (2025).https://doi.org/10.1109/TPAMI.2025.3533737

  19. [19]

    In: ICCV

    Guan, T., Shen, W., Yang, X., Feng, Q., Jiang, Z., Yang, X.: Self-supervised character-to-character distillation for text recognition. In: ICCV. pp. 19473–19484 (2023)

  20. [20]

    In: CVPR

    Gupta, A., Vedaldi, A., Zisserman, A.: Synthetic data for text localisation in nat- ural images. In: CVPR. pp. 2315–2324 (2016)

  21. [21]

    In: ECCV

    Heo, B., Park, S., Han, D., Yun, S.: Rotary position embedding for vision trans- former. In: ECCV. pp. 289–305 (2024)

  22. [22]

    In: ICLR (2022)

    Hu, E.J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., Chen, W., et al.: Lora: Low-rank adaptation of large language models. In: ICLR (2022)

  23. [23]

    CoRRabs/1406.2227 (2014)

    Jaderberg, M., Simonyan, K., Vedaldi, A., Zisserman, A.: Synthetic data and ar- tificial neural networks for natural scene text recognition. CoRRabs/1406.2227 (2014)

  24. [24]

    In: ICCV

    Jiang, Q., Wang, J., Peng, D., Liu, C., Jin, L.: Revisiting scene text recognition: A data perspective. In: ICCV. pp. 20486–20497 (2023)

  25. [25]

    In: ICDAR

    Karatzas, D., Gomez-Bigorda, L., Nicolaou, A., Ghosh, S., Bagdanov, A., Iwamura, M., Matas, J., Neumann, L., Chandrasekhar, V.R., Lu, S., et al.: ICDAR 2015 competition on robust reading. In: ICDAR. pp. 1156–1160 (2015)

  26. [26]

    In: ICDAR

    Karatzas, D., Shafait, F., Uchida, S., Iwamura, M., i Bigorda, L.G., Mestre, S.R., Mas, J., Mota, D.F., Almazan, J.A., De Las Heras, L.P.: ICDAR 2013 robust reading competition. In: ICDAR. pp. 1484–1493 (2013)

  27. [27]

    CoRRabs/2506.15742(2025)

    Labs,B.F.,Batifol,S.,Blattmann,A.,Boesel,F.,Consul,S.,Diagne,C.,Dockhorn, T., English, J., English, Z., Esser, P., Kulal, S., Lacey, K., Levi, Y., Li, C., Lorenz, D., Müller, J., Podell, D., Rombach, R., Saini, H., Sauer, A., Smith, L.: Flux.1 kontext: Flow matching for in-context image generation and editing in latent space. CoRRabs/2506.15742(2025)

  28. [28]

    In: AAAI

    Li, H., Wang, P., Shen, C., Zhang, G.: Show, attend and read: A simple and strong baseline for irregular text recognition. In: AAAI. vol. 33, pp. 8610–8617 (2019)

  29. [29]

    In: AAAI

    Li, M., Lv, T., Chen, J., Cui, L., Lu, Y., Florencio, D., Zhang, C., Li, Z., Wei, F.: Trocr: Transformer-based optical character recognition with pre-trained models. In: AAAI. vol. 37, pp. 13094–13102 (2023)

  30. [30]

    In: ICLR (2019)

    Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. In: ICLR (2019)

  31. [31]

    In: BMVC

    Mishra, A., Alahari, K., Jawahar, C.: Scene text recognition using higher order language priors. In: BMVC. pp. 1–11 (2012)

  32. [32]

    In: ICCV

    Phan, T.Q., Shivakumara, P., Tian, S., Tan, C.L.: Recognizing text with perspec- tive distortion in natural scenes. In: ICCV. pp. 569–576 (2013)

  33. [33]

    In: ACM MM

    Qiao, Z., Zhou, Y., Wei, J., Wang, W., Zhang, Y., Jiang, N., Wang, H., Wang, W.: Pimnet: a parallel, iterative and mimicking network for scene text recognition. In: ACM MM. pp. 2046–2055 (2021) Advancing WordArt-Oriented Scene Text Recognition 17

  34. [34]

    ESWA41(18), 8027–8048 (2014)

    Risnumawan, A., Shivakumara, P., Chan, C.S., Tan, C.L.: A robust arbitrary text detection system for natural scene images. ESWA41(18), 8027–8048 (2014)

  35. [35]

    In: ICDAR

    Sheng, F., Chen, Z., Xu, B.: NRTR: A no-recurrence sequence-to-sequence model for scene text recognition. In: ICDAR. pp. 781–786 (2019)

  36. [36]

    IEEE Trans

    Shi, B., Bai, X., Yao, C.: An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition. IEEE Trans. Pattern Anal. Mach. Intell.39(11), 2298–2304 (2016).https://doi.org/10.1109/ TPAMI.2016.2646371

  37. [37]

    Neurocomputing568, 127063 (2024)

    Su, J., Ahmed, M., Lu, Y., Pan, S., Bo, W., Liu, Y.: Roformer: Enhanced trans- former with rotary position embedding. Neurocomputing568, 127063 (2024)

  38. [38]

    CoRRabs/2511.19575(2025)

    Team, H.V., Lyu, P., Wan, X., Li, G., Peng, S., Wang, W., Wu, L., Shen, H., Zhou, Y., Tang, C., et al.: Hunyuanocr technical report. CoRRabs/2511.19575(2025)

  39. [39]

    In: ICCV

    Wang, K., Babenko, B., Belongie, S.: End-to-end scene text recognition. In: ICCV. pp. 1457–1464 (2011)

  40. [40]

    5: Advancing open-source multimodal models in versatility, reasoning, and efficiency

    Wang, W., Gao, Z., Gu, L., Pu, H., Cui, L., Wei, X., Liu, Z., Jing, L., Ye, S., Shao, J., et al.: Internvl3. 5: Advancing open-source multimodal models in versatility, reasoning, and efficiency. CoRRabs/2508.18265(2025)

  41. [41]

    CoRRabs/2409.01704(2024)

    Wei, H., Liu, C., Chen, J., Wang, J., Kong, L., Xu, Y., Ge, Z., Zhao, L., Sun, J., Peng, Y., et al.: General ocr theory: Towards ocr-2.0 via a unified end-to-end model. CoRRabs/2409.01704(2024)

  42. [42]

    CoRR abs/2510.18234(2025)

    Wei, H., Sun, Y., Li, Y.: Deepseek-ocr: Contexts optical compression. CoRR abs/2510.18234(2025)

  43. [43]

    CoRR abs/2601.20552(2026)

    Wei, H., Sun, Y., Li, Y.: Deepseek-ocr 2: Visual causal flow. CoRR abs/2601.20552(2026)

  44. [44]

    In: AAAI

    Wei, J., Zhan, H., Lu, Y., Tu, X., Yin, B., Liu, C., Pal, U.: Image as a language: Revisiting scene text recognition via balanced, unified and synchronized vision- language reasoning network. In: AAAI. vol. 38, pp. 5885–5893 (2024)

  45. [45]

    CoRRabs/2508.02324(2025)

    Wu, C., Li, J., Zhou, J., Lin, J., Gao, K., Yan, K., Yin, S.m., Bai, S., Xu, X., Chen, Y., et al.: Qwen-image technical report. CoRRabs/2508.02324(2025)

  46. [46]

    In: ICDAR

    Xie, X., Deng, L., Zhang, Z., Wang, Z., Liu, Y.: Icdar 2024 competition on artistic text recognition. In: ICDAR. pp. 301–314 (2024)

  47. [47]

    In: ECCV

    Xie, X., Fu, L., Zhang, Z., Wang, Z., Bai, X.: Toward understanding wordart: Corner-guided transformer for scene text recognition. In: ECCV. pp. 303–321 (2022)

  48. [48]

    In: ECCV

    Xie, X., Li, Y., Liu, Y., Zhang, Z., Wang, Z., Xiong, W., Bai, X.: Was: Dataset and methods for artistic text segmentation. In: ECCV. pp. 237–254 (2024)

  49. [49]

    In: CVPR

    Xu,J.,Wang,Y.,Xie,H.,Zhang,Y.:Ote:Exploringaccuratescenetextrecognition using one token. In: CVPR. pp. 28327–28336 (2024)

  50. [50]

    In: CVPR

    Xu, X., Zhang, Z., Wang, Z., Price, B., Wang, Z., Shi, H.: Rethinking text segmen- tation: A novel dataset and a text-specific refinement approach. In: CVPR. pp. 12045–12055 (2021)

  51. [51]

    In: ICCV

    Ye, X., Du, Y., Tao, Y., Chen, Z.: Textssr: Diffusion-based data synthesis for scene text recognition. In: ICCV. pp. 17464–17473 (2025)

  52. [52]

    In: CVPR

    Ye, X., Du, Y., Zhang, J., Li, C., LYU, J., Chen, Z.: What’s wrong with synthetic data for scene text recognition? a strong synthetic engine with diverse simulations and self-evolution. In: CVPR. pp. 16645–16654 (2026)

  53. [53]

    In: ICDAR

    Yim, M., Kim, Y., Cho, H.C., Park, S.: Synthtiger: Synthetic text image generator towards better text recognition models. In: ICDAR. pp. 109–124 (2021)

  54. [54]

    In: CVPR

    Yu, D., Li, X., Zhang, C., Liu, T., Han, J., Liu, J., Ding, E.: Towards accurate scene text recognition with semantic reasoning networks. In: CVPR. pp. 12113– 12122 (2020) 18 X. Ye et al

  55. [55]

    In: CCPR

    Zhai, C., Chen, Z., Li, J., Xu, B.: Chinese image text recognition with blstm-ctc: a segmentation-free method. In: CCPR. pp. 525–536 (2016)

  56. [56]

    In: IJCAI

    Zhang, B., Xie, H., Wang, Y., Xu, J., Zhang, Y.: Linguistic more: Taking a further step toward efficient and accurate scene text recognition. In: IJCAI. pp. 1704–1712 (2023)

  57. [57]

    I am providing an image of an artistic text area that has been cropped from a real photograph. “ f“The original text within the image is “{text_label}“

    Zhu, Y., Liu, J., Gao, F., Liu, W., Wang, X., Wang, P., Huang, F., Yao, C., Yang, Z.: Visual text generation in the wild. In: ECCV. pp. 89–106 (2024) Appendix Fig. 1:Word cloud of artistic font tags used in WATER-T. A More Data Details A.1 Resources Used and Generated in WATER-S Artistic FontsWe first collect artistic font resources from open-source font ...