pith. sign in

arxiv: 2605.18173 · v1 · pith:EEOR7C6Mnew · submitted 2026-05-18 · 💻 cs.CV

Do You Need Text Rectification? Soft Attention Mask Embedding for Rectification-Free Scene Text Spotting

Pith reviewed 2026-05-20 11:24 UTC · model grok-4.3

classification 💻 cs.CV
keywords scene text spottingtext detectiontext recognitionattention mechanismtransformer encodermask embeddingend-to-end frameworkarbitrary shape text
0
0 comments X

The pith

Soft attention weights from transformers refine text masks to enable accurate spotting without any rectification step.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes a Soft Attention Mask Embedding module that processes high-level features through a Transformer encoder to produce soft attention weights. These weights are then combined step by step with initial mask predictions to create cleaner text-boundary masks that block out background interference. The resulting SAME-Net framework performs end-to-end detection and recognition on curved or rotated text while skipping both character-level labels and any separate rectification network. Gradients from the recognition loss flow back through the module to improve the detection branch during joint training. Tests on Total-Text and ICDAR 2015 show higher accuracy than prior rectification-based systems.

Core claim

By computing soft attention weights from Transformer-encoded high-level features and hierarchically embedding them with predicted masks, the SAME module produces refined text-boundary-aware masks that suppress background noise, allowing a single network to perform robust end-to-end scene text spotting without character-level annotations or auxiliary rectification modules.

What carries the argument

The Soft Attention Mask Embedding (SAME) module, which uses Transformer encoders to generate soft attention weights and embeds them hierarchically with mask predictions to refine text boundaries and reduce noise.

If this is right

  • Joint training of detection and recognition becomes possible because the module is fully differentiable and passes recognition gradients to the detection branch.
  • No character-level annotations or separate rectification modules are required while still handling arbitrarily shaped and multi-oriented text.
  • Accuracy gains appear on curved-text benchmarks without using extra training data beyond standard sets.
  • The same pipeline delivers competitive results on multi-oriented text datasets while removing the rectification component.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same attention-based refinement idea could be tested on other tasks involving irregular shapes, such as segmenting curved structures in medical scans or aerial imagery.
  • Removing explicit rectification may lower overall model complexity and inference time for real-time mobile text spotting applications.
  • If the mechanism proves robust across domains, it could reduce reliance on geometric transformations in broader irregular-object recognition pipelines.

Load-bearing premise

Soft attention weights derived from high-level Transformer features can be embedded with masks to create boundary-aware refinements that reliably separate arbitrary-shaped text from complex backgrounds.

What would settle it

Measure end-to-end accuracy of SAME-Net against an otherwise identical network that adds an explicit rectification branch on a dataset containing extreme perspective warps and heavy background clutter; if the rectification version wins by a clear margin, the claim weakens.

read the original abstract

End-to-end scene text spotting, which unifies text detection and recognition within a single framework, has witnessed remarkable progress driven by deep learning advances. However, most existing approaches still suffer from incomplete mask proposals caused by multi-scale variation, arbitrary text shapes, and complex background interference, thereby degrading recognition accuracy. In this paper, we propose a novel Soft Attention Mask Embedding module (SAME) that leverages the global receptive field of Transformer encoders to encode high-level features and compute soft attention weights, which are then hierarchically embedded with predicted masks to generate refined text-boundary-aware masks that effectively suppress background noise. Building upon this module, we present SAME-Net, a robust end-to-end text spotting framework that requires neither character-level annotations nor auxiliary text rectification modules. Since the soft attention mechanism is fully differentiable, recognition loss gradients can be back-propagated through the SAME module to the detection branch, enabling joint optimization of detection and recognition objectives. Extensive experiments on challenging benchmarks demonstrate the effectiveness of our approach: SAME-Net achieves 84.02\% end-to-end H-mean on the arbitrarily-shaped Total-Text dataset, surpassing the previous state-of-the-art GLASS by 1.02\% in full-lexicon accuracy without additional training data, and obtains competitive 83.4\% strong-lexicon results on the multi-oriented ICDAR 2015 dataset.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes the Soft Attention Mask Embedding (SAME) module for rectification-free end-to-end scene text spotting. SAME uses Transformer encoders on high-level features to compute soft attention weights that are hierarchically embedded with predicted masks, yielding refined text-boundary-aware masks to suppress background noise for arbitrary shapes. The resulting SAME-Net framework requires no character-level annotations or auxiliary rectification, supports joint optimization of detection and recognition via back-propagation, and reports 84.02% end-to-end H-mean on Total-Text (1.02% above GLASS) plus competitive 83.4% strong-lexicon results on ICDAR 2015.

Significance. If the empirical claims hold under rigorous verification, the work would be significant for simplifying scene text spotting pipelines by eliminating explicit rectification while improving performance on challenging arbitrary-shape benchmarks. The fully differentiable soft attention design enabling joint detection-recognition optimization is a clear strength, as is the reported gain without additional training data. These elements could influence future architectures if the resolution and reproducibility concerns are resolved.

major comments (2)
  1. [§3] §3 (SAME module): The central claim that Transformer-encoded high-level features produce soft attention weights sufficient for precise boundary refinement via hierarchical embedding rests on the assumption that global context compensates for downsampled resolution. This is load-bearing for the rectification-free assertion; without explicit mechanisms (e.g., multi-scale fusion or upsampling details) to recover fine local boundaries amid multi-scale variation, the reported gains on Total-Text may not generalize.
  2. [§4] §4 (Experiments): The abstract and results claim specific benchmark improvements (84.02% H-mean, +1.02% over GLASS) but provide no details on data splits, ablation studies, error bars, or run counts. This absence directly affects verification of whether the SAME module drives the gains or if post-hoc choices are involved, undermining confidence in the joint-optimization benefit.
minor comments (1)
  1. Notation for the hierarchical embedding step could be clarified with a diagram or pseudocode to improve readability of the mask refinement process.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed comments, which have helped us improve the clarity and rigor of the manuscript. We address each major comment point by point below, providing explanations and indicating revisions made where appropriate.

read point-by-point responses
  1. Referee: [§3] §3 (SAME module): The central claim that Transformer-encoded high-level features produce soft attention weights sufficient for precise boundary refinement via hierarchical embedding rests on the assumption that global context compensates for downsampled resolution. This is load-bearing for the rectification-free assertion; without explicit mechanisms (e.g., multi-scale fusion or upsampling details) to recover fine local boundaries amid multi-scale variation, the reported gains on Total-Text may not generalize.

    Authors: We appreciate the referee's emphasis on this foundational aspect of the SAME module. The Transformer encoder is applied to high-level features precisely to leverage its global receptive field for capturing long-range context, which informs the computation of soft attention weights. These weights are then hierarchically embedded with the predicted masks across multiple stages, enabling progressive boundary refinement and background suppression without requiring character-level annotations or explicit rectification. This design choice allows the global context to compensate for resolution loss from downsampling, as the attention mechanism adaptively focuses on text-relevant regions amid multi-scale and arbitrary-shape variations. To address the concern directly, we have revised §3 to include an expanded explanation of the hierarchical embedding process, a new figure illustrating the multi-stage refinement, and an ablation study isolating the Transformer's contribution. While we maintain that the current architecture suffices for the reported gains on Total-Text (as the rectification-free pipeline achieves state-of-the-art results), we have added a note in the discussion acknowledging that explicit multi-scale fusion could be explored as future work. revision: partial

  2. Referee: [§4] §4 (Experiments): The abstract and results claim specific benchmark improvements (84.02% H-mean, +1.02% over GLASS) but provide no details on data splits, ablation studies, error bars, or run counts. This absence directly affects verification of whether the SAME module drives the gains or if post-hoc choices are involved, undermining confidence in the joint-optimization benefit.

    Authors: We agree that additional experimental details are critical for reproducibility and to substantiate the role of the SAME module in driving the observed improvements. In the revised manuscript, we have substantially expanded §4 with the following: explicit descriptions of the standard data splits and preprocessing for Total-Text and ICDAR 2015; comprehensive ablation studies (including tables) on key components such as the Transformer encoder, soft attention weights, and hierarchical embedding, demonstrating their individual and combined contributions; results reported as mean with standard deviation over three independent runs to provide error bars; and further clarification on the back-propagation path enabling joint detection-recognition optimization. These revisions confirm that the 84.02% H-mean and the 1.02% gain over GLASS are attributable to the proposed module rather than post-hoc decisions, thereby strengthening confidence in the joint-optimization benefit. revision: yes

Circularity Check

0 steps flagged

No circularity: architecture proposal validated empirically on external benchmarks

full rationale

The paper proposes the SAME module as a differentiable architectural component that computes soft attention from Transformer-encoded features and embeds it hierarchically with mask predictions to refine boundaries. This is presented as an engineering design choice, not a derivation that reduces to its own fitted parameters or prior self-citations. The central performance claims (84.02% H-mean on Total-Text, +1.02% over GLASS) are reported as outcomes of end-to-end training and evaluation on standard public datasets, with no equations or uniqueness theorems shown that would make the reported gains tautological by construction. The module is fully differentiable by design, allowing joint optimization, but this does not create a self-definitional loop. No load-bearing self-citation chains or renamed empirical patterns are evident in the provided description.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The approach relies on standard transformer attention and differentiability assumptions plus the new SAME module; no explicit free parameters or invented physical entities are described in the abstract.

axioms (1)
  • domain assumption Soft attention weights from Transformer encoders can be hierarchically embedded with masks to refine boundaries
    Invoked in the description of the SAME module to suppress background noise without rectification.
invented entities (1)
  • Soft Attention Mask Embedding (SAME) module no independent evidence
    purpose: Generate refined text-boundary-aware masks from rough proposals using transformer attention
    New component introduced to enable rectification-free spotting; no independent evidence outside the paper is provided.

pith-pipeline@v0.9.0 · 5773 in / 1391 out tokens · 28067 ms · 2026-05-20T11:24:56.858341+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

74 extracted references · 74 canonical work pages · 4 internal anchors

  1. [1]

    A survey on table recognition technology,

    L. Gao, Y . Li, L. Du, X. Zhang, Z. Zhu, N. Lu, L. Jin, Y . Huang, and Z. Tang, “A survey on table recognition technology,”Journal of Image and Graphics, vol. 27, no. 6, pp. 1898–1917, 2022

  2. [2]

    Deep learning methods for scene text detection and recognition,

    C. Liu, X. Chen, C. Luo, L. Jin, Y . Xue, and Y . Liu, “Deep learning methods for scene text detection and recognition,” Journal of Image and Graphics, vol. 26, no. 6, pp. 1330– 1367, 2021

  3. [3]

    TextSquare: Scaling up text-centric visual instruction tuning,

    J. Tang, C. Lin, Z. Zhao, S. Wei, B. Wu, Q. Liu, H. Feng, Y . Li, S. Wang, L. Liaoet al., “TextSquare: Scaling up text-centric visual instruction tuning,”arXiv preprint arXiv:2404.12803, 2024

  4. [4]

    An end-to-end trainable neural network for image-based sequence recognition and its ap- plication to scene text recognition,

    B. Shi, X. Bai, and C. Yao, “An end-to-end trainable neural network for image-based sequence recognition and its ap- plication to scene text recognition,” vol. 39, no. 11. IEEE, 2017, pp. 2298–2304

  5. [5]

    Charac- ter region awareness for text detection,

    Y . Baek, B. Lee, D. Han, S. Yun, and H. Lee, “Charac- ter region awareness for text detection,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pat- tern Recognition (CVPR). IEEE, 2019, pp. 9357–9366

  6. [6]

    GPT-4 Technical Report

    J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkatet al., “GPT-4 technical report,”arXiv preprint arXiv:2303.08774, 2023

  7. [7]

    Denoising diffusion proba- bilistic models,

    J. Ho, A. Jain, and P. Abbeel, “Denoising diffusion proba- bilistic models,”Advances in Neural Information Process- ing Systems, vol. 33, pp. 6840–6851, 2020

  8. [8]

    High-resolution image synthesis with la- tent diffusion models,

    R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer, “High-resolution image synthesis with la- tent diffusion models,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 10 684–10 695

  9. [9]

    SETrans- former: A hybrid attention-based architecture for robust human activity recognition,

    Y . Liu, X. Qin, Y . Gao, X. Li, and C. Feng, “SETrans- former: A hybrid attention-based architecture for robust human activity recognition,”INNO-PRESS: Journal of Emerging Applied AI, vol. 1, no. 1, 2025

  10. [10]

    Attention is all you need,

    A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” inAdvances in Neural Information Processing Systems (NeurIPS), vol. 30, 2017

  11. [11]

    End-to-end object detection with transformers,

    N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kir- illov, and S. Zagoruyko, “End-to-end object detection with transformers,” inProceedings of the European Conference on Computer Vision (ECCV). Springer, 2020, pp. 213– 229

  12. [12]

    End-to-end scene text recognition,

    K. Wang, B. Babenko, and S. Belongie, “End-to-end scene text recognition,” inProceedings of the International Con- ference on Computer Vision (ICCV). IEEE, 2011, pp. 1457–1464

  13. [13]

    Pho- toocr: Reading text in uncontrolled conditions,

    A. Bissacco, M. Cummins, Y . Netzer, and H. Neven, “Pho- toocr: Reading text in uncontrolled conditions,” inPro- ceedings of the IEEE International Conference on Com- puter Vision (ICCV). IEEE, 2013, pp. 785–792

  14. [14]

    Textboxes: A fast text detector with a single deep neural network,

    M. Liao, B. Shi, X. Bai, X. Wang, and W. Liu, “Textboxes: A fast text detector with a single deep neural network,” inProceedings of the 31st AAAI Conference on Artificial Intelligence. AAAI Press, 2017, pp. 4161–4167

  15. [15]

    Towards end-to-end text spotting with convolutional recurrent neural networks,

    H. Li, P. Wang, and C. Shen, “Towards end-to-end text spotting with convolutional recurrent neural networks,” inProceedings of the IEEE International Conference on Computer Vision (ICCV). IEEE, 2017, pp. 5248–5256

  16. [16]

    Fots: Fast oriented text spotting with a unified network,

    X. Liu, D. Liang, S. Yan, D. Chen, Y . Qiao, and J. Yan, “Fots: Fast oriented text spotting with a unified network,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2018, pp. 5676–5685

  17. [17]

    To- wards unconstrained end-to-end text spotting,

    S. Qin, A. Bissaco, M. Raptis, Y . Fujii, and Y . Xiao, “To- wards unconstrained end-to-end text spotting,” inProceed- ings of the IEEE/CVF International Conference on Com- puter Vision (ICCV). IEEE, 2019, pp. 4703–4713

  18. [18]

    Dol- phin: Document image parsing via heterogeneous anchor prompting,

    H. Feng, S. Wei, X. Fei, W. Shi, Y . Han, L. Liao, J. Lu, B. Wu, Q. Liu, C. Lin, J. Tang, and C. Huang, “Dol- phin: Document image parsing via heterogeneous anchor prompting,” inFindings of the Association for Computa- tional Linguistics: ACL 2025, 2025, pp. 21 919–21 936

  19. [19]

    UniDoc: A universal large multimodal model for simultaneous text detection, recognition, spotting and understanding,

    H. Feng, Z. Wang, J. Tang, J. Lu, W. Zhou, H. Li, and C. Huang, “UniDoc: A universal large multimodal model for simultaneous text detection, recognition, spotting and understanding,”arXiv preprint arXiv:2308.11592, 2023

  20. [20]

    Swintextspotter: Scene text spotting via better synergy between text detection and text recog- nition,

    M. Huang, Y . Liu, Z. Peng, C. Liu, D. Lin, S. Zhu, N. Yuan, K. Ding, and L. Jin, “Swintextspotter: Scene text spotting via better synergy between text detection and text recog- nition,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2022, pp. 4583–4593

  21. [21]

    Text spotting transformers,

    X. Zhang, Y . Su, S. Tripathi, and Z. Tu, “Text spotting transformers,” inProceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition (CVPR). IEEE, 2022, pp. 9509–9518

  22. [22]

    Deepsolo: Let transformer decoder with explicit points solo for text spotting,

    M. Ye, J. Zhang, S. Zhao, J. Liu, T. Liu, B. Du, and D. Tao, “Deepsolo: Let transformer decoder with explicit points solo for text spotting,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2023, pp. 19 348–19 357

  23. [23]

    You Only Look Once: Unified, Real-Time Object Detection

    J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You only look once: Unified, real-time object detection,”arXiv preprint arXiv:1506.02640, 2016

  24. [24]

    Trends and prospects of techniques for haze removal from degraded images: A survey,

    G. Sahu, A. Seal, D. Bhattacharjee, M. Nasipuri, P. Brida, and O. Krejcar, “Trends and prospects of techniques for haze removal from degraded images: A survey,”IEEE Transactions on Emerging Topics in Computational Intel- ligence, vol. 6, no. 4, pp. 762–782, 2022

  25. [25]

    Mask textspotter: An end-to-end trainable neural network for 8 spotting text with arbitrary shapes,

    P. Lyu, M. Liao, C. Yao, W. Wu, and X. Bai, “Mask textspotter: An end-to-end trainable neural network for 8 spotting text with arbitrary shapes,” inProceedings of the 15th European Conference on Computer Vision (ECCV). Springer, 2018, pp. 71–88

  26. [26]

    Mask textspotter v3: Segmentation proposal network for robust scene text spotting,

    M. Liao, G. Pang, J. Huang, T. Hassner, and X. Bai, “Mask textspotter v3: Segmentation proposal network for robust scene text spotting,” inProceedings of the 16th European Conference on Computer Vision (ECCV). Springer, 2020, pp. 706–722

  27. [27]

    Textsnake: A flexible representation for detecting text of arbitrary shapes,

    S. Long, J. Ruan, W. Zhang, X. He, W. Wu, and C. Yao, “Textsnake: A flexible representation for detecting text of arbitrary shapes,” inProceedings of the European Confer- ence on Computer Vision (ECCV). Springer, 2018, pp. 19–35

  28. [28]

    Few could be better than all: Feature sampling and grouping for scene text detection,

    J. Tang, W. Zhang, H. Liu, M. Yang, B. Jiang, G. Hu, and X. Bai, “Few could be better than all: Feature sampling and grouping for scene text detection,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pat- tern Recognition, 2022, pp. 4563–4572

  29. [29]

    You can even annotate text with voice: Transcription- only-supervised text spotting,

    J. Tang, S. Qiao, B. Cui, Y . Ma, S. Zhang, and D. Kanoulas, “You can even annotate text with voice: Transcription- only-supervised text spotting,” inProceedings of the 30th ACM International Conference on Multimedia, 2022, pp. 4154–4163

  30. [30]

    Arts: Eliminating inconsistency between text detection and recognition with auto-rectification text spotter,

    H. Zhong, J. Tang, W. Wang, Z. Yang, C. Yao, and T. Lu, “Arts: Eliminating inconsistency between text detection and recognition with auto-rectification text spotter,”arXiv preprint arXiv:2110.10405, 2021

  31. [31]

    Abcnet v2: Adaptive bezier-curve network for real-time end-to-end text spotting,

    Y . Liu, C. Shen, L. Jin, T. He, P. Chen, C. Liu, and H. Chen, “Abcnet v2: Adaptive bezier-curve network for real-time end-to-end text spotting,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 44, no. 11, pp. 8048–8064, 2022

  32. [32]

    Docpedia: Unleashing the power of large mul- timodal model in the frequency domain for versatile docu- ment understanding,

    H. Feng, Q. Liu, H. Liu, J. Tang, W. Zhou, H. Li, and C. Huang, “Docpedia: Unleashing the power of large mul- timodal model in the frequency domain for versatile docu- ment understanding,”Science China Information Sciences, vol. 67, no. 12, pp. 1–14, 2024

  33. [33]

    Multi-modal in-context learning makes an ego-evolving scene text recognizer,

    Z. Zhao, J. Tang, C. Lin, B. Wu, C. Huang, H. Liu, X. Tan, Z. Zhang, and Y . Xie, “Multi-modal in-context learning makes an ego-evolving scene text recognizer,” inProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 15 567–15 576

  34. [34]

    Harmonizing visual text comprehension and generation,

    Z. Zhao, J. Tang, B. Wu, C. Lin, S. Wei, H. Liu, X. Tan, Z. Zhang, C. Huang, and Y . Xie, “Harmonizing visual text comprehension and generation,”arXiv preprint arXiv:2407.16364, 2024

  35. [35]

    Estextspotter: Towards better scene text spotting with explicit synergy in transformer,

    M. Huang, J. Zhang, D. Peng, H. Lu, C. Huang, Y . Liu, X. Bai, and L. Jin, “Estextspotter: Towards better scene text spotting with explicit synergy in transformer,” inPro- ceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). IEEE, 2023, pp. 19 495–19 505

  36. [36]

    Feature pyramid networks for object detec- tion,

    T.-Y . Lin, P. Doll´ar, R. Girshick, K. He, B. Hariharan, and S. Belongie, “Feature pyramid networks for object detec- tion,” inProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2017, pp. 936–944

  37. [37]

    Mtvqa: Benchmarking multilingual text-centric visual question answering

    J. Tang, Q. Liu, Y . Ye, J. Lu, S. Wei, C. Lin, W. Li, M. F. F. B. Mahmood, H. Feng, Z. Zhaoet al., “MTVQA: Benchmarking multilingual text-centric visual question answering,”arXiv preprint arXiv:2405.11985, 2024

  38. [38]

    MCTBench: Multi- modal cognition towards text-rich visual scenes bench- mark,

    B. Shan, X. Fei, W. Shi, A. Wang, G. Tang, L. Liao, J. Tang, X. Bai, and C. Huang, “MCTBench: Multi- modal cognition towards text-rich visual scenes bench- mark,”arXiv preprint arXiv:2410.11538, 2024

  39. [39]

    SPTS v2: Single-point scene text spotting,

    Y . Liu, J. Zhang, D. Peng, M. Huang, X. Wang, J. Tang, C. Huang, D. Lin, C. Shen, X. Bai, and L. Jin, “SPTS v2: Single-point scene text spotting,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 45, no. 12, pp. 15 038–15 055, 2023

  40. [40]

    Optimal boxes: Boosting end-to-end scene text recogni- tion by adjusting annotated bounding boxes via reinforce- ment learning,

    J. Tang, W. Qian, L. Song, X. Dong, L. Li, and X. Bai, “Optimal boxes: Boosting end-to-end scene text recogni- tion by adjusting annotated bounding boxes via reinforce- ment learning,” inEuropean Conference on Computer Vi- sion. Springer, 2022, pp. 233–248

  41. [41]

    A bounding box is worth one token: Interleaving layout and text in a large language model for document understanding,

    J. Lu, H. Yu, Y . Wang, Y . Ye, J. Tang, Z. Yang, B. Wu, Q. Liu, H. Feng, H. Wanget al., “A bounding box is worth one token: Interleaving layout and text in a large language model for document understanding,” inFindings of the As- sociation for Computational Linguistics: ACL 2025, 2025, pp. 7252–7273

  42. [42]

    WildDoc: How far are we from achieving comprehensive and robust document un- derstanding in the wild?

    A. Wang, J. Tang, L. Liao, H. Feng, Q. Liu, X. Fei, J. Lu, H. Wang, H. Liu, Y . Liuet al., “WildDoc: How far are we from achieving comprehensive and robust document un- derstanding in the wild?” inProceedings of the 2025 Con- ference on Empirical Methods in Natural Language Pro- cessing, 2025

  43. [43]

    Vision as LoRA,

    H. Wang, Y . Ye, B. Li, Y . Nie, J. Lu, J. Tang, Y . Wang, and C. Huang, “Vision as LoRA,”arXiv preprint arXiv:2503.20680, 2025

  44. [44]

    OCRBench v2: An Improved Benchmark for Evaluating Large Multimodal Models on Visual Text Localization and Reasoning

    L. Fu, Z. Kuang, J. Song, M. Huang, B. Yang, Y . Li, L. Zhu, Q. Luo, X. Wang, H. Lu, G. Tang, B. Shan, C. Lin, Q. Liu, B. Wu, H. Feng, H. Liu, C. Huang, J. Tang, W. Chen, L. Jin, Y . Liu, and X. Bai, “OCRBench v2: An improved benchmark for evaluating large multimodal models on visual text localization and reasoning,”arXiv preprint arXiv:2501.00321, 2024

  45. [45]

    Character recognition competi- tion for street view shop signs,

    J. Tang, W. Du, B. Wang, W. Zhou, S. Mei, T. Xue, X. Xu, and H. Zhang, “Character recognition competi- tion for street view shop signs,”National Science Review, vol. 10, no. 6, p. nwad141, 2023

  46. [46]

    Benchmark- ing vision-language models on chinese ancient documents: From OCR to knowledge reasoning,

    H. Yu, Y . Wu, F. Shi, L. Liao, J. Lu, X. Ge, H. Wang, M. Zhuo, X. Wu, X. Fei, J. Tanget al., “Benchmark- ing vision-language models on chinese ancient documents: From OCR to knowledge reasoning,”arXiv preprint arXiv:2509.09731, 2025

  47. [47]

    Pargo: Bridging vision-language with partial and global views,

    A.-L. Wang, B. Shan, W. Shi, K.-Y . Lin, X. Fei, G. Tang, L. Liao, J. Tang, C. Huang, and W.-S. Zheng, “Pargo: Bridging vision-language with partial and global views,” vol. 39, no. 7, pp. 7491–7499, 2025

  48. [48]

    Textdragon: An end-to-end framework for arbitrary 9 shaped text spotting,

    W. Feng, W. He, F. Yin, X.-Y . Zhang, and C.-L. Liu, “Textdragon: An end-to-end framework for arbitrary 9 shaped text spotting,” inProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). IEEE, 2019, pp. 9075–9084

  49. [49]

    Abc- net: Real-time scene text spotting with adaptive bezier- curve network,

    Y . Liu, H. Chen, C. Shen, T. He, L. Jin, and L. Wang, “Abc- net: Real-time scene text spotting with adaptive bezier- curve network,” inProceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition (CVPR). IEEE, 2020, pp. 9806–9815

  50. [50]

    MEML-GRPO: Heterogeneous multi-expert mutual learning for RLVR advancement,

    W. Jia, J. Lu, H. Yu, S. Wang, G. Tang, A. Wang, W. Yin, D. Yang, Y . Nieet al., “MEML-GRPO: Heterogeneous multi-expert mutual learning for RLVR advancement,” in Proceedings of the AAAI Conference on Artificial Intelli- gence, vol. 40, no. 37, 2026, p. 31283

  51. [51]

    TC-Pad ´e: Trajectory- consistent Pad´e approximation for diffusion acceleration,

    B. Cui, S. He, B. Huang, Z. Ye, Y . Sun, L. Huang, H. Xue, Y . Yang, J. Tanget al., “TC-Pad ´e: Trajectory- consistent Pad´e approximation for diffusion acceleration,” arXiv preprint arXiv:2603.02943, 2026

  52. [52]

    An end-to-end textspotter with explicit alignment and at- tention,

    T. He, Z. Tian, W. Huang, C. Shen, Y . Qiao, and C. Sun, “An end-to-end textspotter with explicit alignment and at- tention,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2018, pp. 5020–5029

  53. [53]

    Advancing sequential numerical prediction in autoregressive models,

    X. Fei, J. Lu, Q. Sun, H. Feng, Y . Wang, W. Shi, A. Wang, J. Tang, and C. Huang, “Advancing sequential numerical prediction in autoregressive models,” inProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics, 2025

  54. [54]

    Prolonged reason- ing is not all you need: Certainty-based adaptive rout- ing for efficient LLM/MLLM reasoning,

    J. Lu, H. Yu, S. Xu, S. Ran, G. Tang, S. Wang, B. Shan, T. Fu, H. Feng, J. Tanget al., “Prolonged reason- ing is not all you need: Certainty-based adaptive rout- ing for efficient LLM/MLLM reasoning,”arXiv preprint arXiv:2505.15154, 2025

  55. [55]

    Attentive eraser: Unleashing diffusion model’s object removal po- tential via self-attention redirection guidance,

    W. Sun, X.-M. Dong, B. Cui, and J. Tang, “Attentive eraser: Unleashing diffusion model’s object removal po- tential via self-attention redirection guidance,” vol. 39, no. 19, pp. 20 734–20 742, 2025

  56. [56]

    Real-time scene text detection with differentiable binarization and adaptive scale fusion,

    M. Liao, Z. Zou, Z. Wan, C. Yao, and X. Bai, “Real-time scene text detection with differentiable binarization and adaptive scale fusion,” vol. 45, no. 1. IEEE, 2023, pp. 919–931

  57. [57]

    TabPedia: Towards comprehensive visual table understanding with concept synergy,

    W. Zhao, H. Feng, Q. Liu, J. Tang, S. Wei, B. Wu, L. Liao, Y . Ye, H. Liu, W. Zhou, H. Li, and C. Huang, “TabPedia: Towards comprehensive visual table understanding with concept synergy,” inAdvances in Neural Information Pro- cessing Systems, vol. 37, 2024

  58. [58]

    Glass: Global to local attention for scene- text spotting,

    R. Ronen, S. Tsiper, O. Anschel, I. Lavi, A. Markovitz, and R. Manmatha, “Glass: Global to local attention for scene- text spotting,”arXiv preprint arXiv:2208.03364, 2022

  59. [59]

    Dolphin-v2: Universal document parsing via scalable anchor prompting,

    H. Feng, W. Shi, K. Zhang, X. Fei, L. Liao, D. Yang, Y . Du, X. Wu, J. Tang, Y . Liu, and X. Bai, “Dolphin-v2: Universal document parsing via scalable anchor prompting,” 2026

  60. [60]

    Resolving evidence sparsity: Agentic context engineer- ing for long-document understanding,

    K. Liu, Z. Chen, M. Li, J. Tang, D. Yang, and L. Zhang, “Resolving evidence sparsity: Agentic context engineer- ing for long-document understanding,”arXiv preprint arXiv:2511.22850, 2025

  61. [61]

    Mindev: Multi-modal integrated diffusion framework for video reconstruction from eeg signals,

    S. Huang, Y . Wang, H. Luo, H. Jing, C. Qin, and J. Tang, “Mindev: Multi-modal integrated diffusion framework for video reconstruction from eeg signals,” pp. 3350–3359, 2025

  62. [62]

    Swin transformer: Hierarchical vision trans- former using shifted windows,

    Z. Liu, Y . Lin, Y . Cao, H. Hu, Y . Wei, Z. Zhang, S. Lin, and B. Guo, “Swin transformer: Hierarchical vision trans- former using shifted windows,” inProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). IEEE, 2021, pp. 9992–10 002

  63. [63]

    Deep residual learn- ing for image recognition,

    K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learn- ing for image recognition,” inProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2016, pp. 770–778

  64. [64]

    Focal loss for dense object detection,

    T.-Y . Lin, P. Goyal, R. Girshick, K. He, and P. Doll ´ar, “Focal loss for dense object detection,” inProceedings of the IEEE International Conference on Computer Vision (ICCV). IEEE, 2017, pp. 2999–3007

  65. [65]

    Generalized intersection over union: A metric and a loss for bounding box regression,

    H. Rezatofighi, N. Tsoi, J. Gwak, A. Sadeghian, I. Reid, and S. Savarese, “Generalized intersection over union: A metric and a loss for bounding box regression,” inProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2019, pp. 658–666

  66. [66]

    Istr: End-to-end instance segmenta- tion with transformers,

    J. Hu, L. Cao, Y . Lu, S. Zhang, Y . Wang, K. Li, F. Huang, L. Shao, and R. Ji, “Istr: End-to-end instance segmenta- tion with transformers,”arXiv preprint arXiv:2105.00637, 2021

  67. [67]

    Synthetic Data for Text Localisation in Natural Images

    A. Gupta, A. Vedaldi, and A. Zisserman, “Synthetic data for text localization in natural images,”arXiv preprint arXiv:1604.06646, 2016

  68. [68]

    Icdar 2015 competition on robust reading,

    D. Karatzas, L. Gomez-Bigorda, A. Nicolaou, S. Ghosh, A. Bagdanov, M. Iwamura, J. Matas, L. Neumann, V . R. Chandrasekhar, S. Lu, F. Shafait, S. Uchida, and E. Val- veny, “Icdar 2015 competition on robust reading,” inPro- ceedings of the 13th International Conference on Docu- ment Analysis and Recognition (ICDAR). IEEE, 2015, pp. 1156–1160

  69. [69]

    Total-text: A comprehen- sive dataset for scene text detection and recognition,

    C. K. Ch’ng and C. S. Chan, “Total-text: A comprehen- sive dataset for scene text detection and recognition,” in Proceedings of the 14th IAPR International Conference on Document Analysis and Recognition (ICDAR). IEEE, 2017, pp. 935–942

  70. [70]

    Mango: A mask attention guided one-stage scene text spotter,

    L. Qiao, Y . Chen, Z. Cheng, Y . Xu, Y . Niu, S. Pu, and F. Wu, “Mango: A mask attention guided one-stage scene text spotter,”arXiv preprint arXiv:2012.04350, 2021

  71. [71]

    All you need is boundary: Toward arbitrary-shaped text spotting,

    H. Wang, P. Lu, H. Zhang, M. Yang, X. Bai, Y . Xu, M. He, Y . Wang, and W. Liu, “All you need is boundary: Toward arbitrary-shaped text spotting,” inProceedings of the 34th AAAI Conference on Artificial Intelligence. AAAI Press, 2020, pp. 12 160–12 167

  72. [72]

    Pgnet: Real-time arbitrarily- shaped text spotting with point gathering network,

    P. Wang, C. Zhang, F. Qi, S. Liu, X. Zhang, P. Lyu, J. Han, J. Liu, E. Ding, and G. Shi, “Pgnet: Real-time arbitrarily- shaped text spotting with point gathering network,”arXiv preprint arXiv:2104.05458, 2021

  73. [73]

    Spts: Single-point text spotting,

    D. Du, X. Chen, J. Peng, J. Liu, D. Peng, and L. Jin, “Spts: Single-point text spotting,” inProceedings of the 30th 10 ACM International Conference on Multimedia. ACM, 2022, pp. 4272–4281

  74. [74]

    Pan++: Towards efficient and accurate end- to-end spotting of arbitrarily-shaped text,

    W. Wang, E. Xie, X. Li, X. Liu, D. Liang, Z. Yang, T. Lu, and C. Shen, “Pan++: Towards efficient and accurate end- to-end spotting of arbitrarily-shaped text,”IEEE Trans- actions on Pattern Analysis and Machine Intelligence, vol. 44, no. 9, pp. 5349–5367, 2022. 11