Do You Need Text Rectification? Soft Attention Mask Embedding for Rectification-Free Scene Text Spotting

Antonio Colombo; Giovanni Bianchi

arxiv: 2605.18173 · v1 · pith:EEOR7C6Mnew · submitted 2026-05-18 · 💻 cs.CV

Do You Need Text Rectification? Soft Attention Mask Embedding for Rectification-Free Scene Text Spotting

Antonio Colombo , Giovanni Bianchi This is my paper

Pith reviewed 2026-05-20 11:24 UTC · model grok-4.3

classification 💻 cs.CV

keywords scene text spottingtext detectiontext recognitionattention mechanismtransformer encodermask embeddingend-to-end frameworkarbitrary shape text

0 comments

The pith

Soft attention weights from transformers refine text masks to enable accurate spotting without any rectification step.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes a Soft Attention Mask Embedding module that processes high-level features through a Transformer encoder to produce soft attention weights. These weights are then combined step by step with initial mask predictions to create cleaner text-boundary masks that block out background interference. The resulting SAME-Net framework performs end-to-end detection and recognition on curved or rotated text while skipping both character-level labels and any separate rectification network. Gradients from the recognition loss flow back through the module to improve the detection branch during joint training. Tests on Total-Text and ICDAR 2015 show higher accuracy than prior rectification-based systems.

Core claim

By computing soft attention weights from Transformer-encoded high-level features and hierarchically embedding them with predicted masks, the SAME module produces refined text-boundary-aware masks that suppress background noise, allowing a single network to perform robust end-to-end scene text spotting without character-level annotations or auxiliary rectification modules.

What carries the argument

The Soft Attention Mask Embedding (SAME) module, which uses Transformer encoders to generate soft attention weights and embeds them hierarchically with mask predictions to refine text boundaries and reduce noise.

If this is right

Joint training of detection and recognition becomes possible because the module is fully differentiable and passes recognition gradients to the detection branch.
No character-level annotations or separate rectification modules are required while still handling arbitrarily shaped and multi-oriented text.
Accuracy gains appear on curved-text benchmarks without using extra training data beyond standard sets.
The same pipeline delivers competitive results on multi-oriented text datasets while removing the rectification component.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same attention-based refinement idea could be tested on other tasks involving irregular shapes, such as segmenting curved structures in medical scans or aerial imagery.
Removing explicit rectification may lower overall model complexity and inference time for real-time mobile text spotting applications.
If the mechanism proves robust across domains, it could reduce reliance on geometric transformations in broader irregular-object recognition pipelines.

Load-bearing premise

Soft attention weights derived from high-level Transformer features can be embedded with masks to create boundary-aware refinements that reliably separate arbitrary-shaped text from complex backgrounds.

What would settle it

Measure end-to-end accuracy of SAME-Net against an otherwise identical network that adds an explicit rectification branch on a dataset containing extreme perspective warps and heavy background clutter; if the rectification version wins by a clear margin, the claim weakens.

read the original abstract

End-to-end scene text spotting, which unifies text detection and recognition within a single framework, has witnessed remarkable progress driven by deep learning advances. However, most existing approaches still suffer from incomplete mask proposals caused by multi-scale variation, arbitrary text shapes, and complex background interference, thereby degrading recognition accuracy. In this paper, we propose a novel Soft Attention Mask Embedding module (SAME) that leverages the global receptive field of Transformer encoders to encode high-level features and compute soft attention weights, which are then hierarchically embedded with predicted masks to generate refined text-boundary-aware masks that effectively suppress background noise. Building upon this module, we present SAME-Net, a robust end-to-end text spotting framework that requires neither character-level annotations nor auxiliary text rectification modules. Since the soft attention mechanism is fully differentiable, recognition loss gradients can be back-propagated through the SAME module to the detection branch, enabling joint optimization of detection and recognition objectives. Extensive experiments on challenging benchmarks demonstrate the effectiveness of our approach: SAME-Net achieves 84.02\% end-to-end H-mean on the arbitrarily-shaped Total-Text dataset, surpassing the previous state-of-the-art GLASS by 1.02\% in full-lexicon accuracy without additional training data, and obtains competitive 83.4\% strong-lexicon results on the multi-oriented ICDAR 2015 dataset.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SAME-Net skips rectification with a transformer-based attention mask embedder and posts a small gain on Total-Text, but thin experimental details leave the resolution worry open.

read the letter

The paper's main contribution is the SAME module inside SAME-Net. It takes high-level features from a transformer encoder, computes soft attention weights, and hierarchically embeds them with the initial mask predictions to produce cleaner, boundary-aware masks that suppress background without any separate rectification step. The whole thing stays fully differentiable so recognition loss can train the detection branch jointly. That is the concrete novelty they ship: a rectification-free pipeline that still claims to handle curved and multi-oriented text well. They report 84.02% end-to-end H-mean on Total-Text, 1.02 points above GLASS, and 83.4% strong-lexicon on ICDAR 2015, all without extra data or character-level labels. Those numbers are the practical hook for anyone running real-world spotting systems. The joint optimization story is also clean engineering. The soft spots are the usual ones at this stage. The abstract gives no ablations, no error bars, and no explicit description of data splits or training protocol, so it is hard to tell how much of the gain is from the SAME module versus careful tuning elsewhere. The stress-test concern about high-level transformer features lacking fine spatial resolution for precise arbitrary-shape boundaries is plausible and not obviously refuted by the abstract alone. If the full paper does not show that the hierarchical embedding actually recovers the lost detail, the central claim rests on thinner ground than the headline numbers suggest. This is for people already working on end-to-end scene text spotting who want a simpler pipeline. A reader who cares about practical speed or fewer hand-crafted stages could extract value. It is solid enough to send to peer review so the experiments and the resolution question can be checked properly.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes the Soft Attention Mask Embedding (SAME) module for rectification-free end-to-end scene text spotting. SAME uses Transformer encoders on high-level features to compute soft attention weights that are hierarchically embedded with predicted masks, yielding refined text-boundary-aware masks to suppress background noise for arbitrary shapes. The resulting SAME-Net framework requires no character-level annotations or auxiliary rectification, supports joint optimization of detection and recognition via back-propagation, and reports 84.02% end-to-end H-mean on Total-Text (1.02% above GLASS) plus competitive 83.4% strong-lexicon results on ICDAR 2015.

Significance. If the empirical claims hold under rigorous verification, the work would be significant for simplifying scene text spotting pipelines by eliminating explicit rectification while improving performance on challenging arbitrary-shape benchmarks. The fully differentiable soft attention design enabling joint detection-recognition optimization is a clear strength, as is the reported gain without additional training data. These elements could influence future architectures if the resolution and reproducibility concerns are resolved.

major comments (2)

[§3] §3 (SAME module): The central claim that Transformer-encoded high-level features produce soft attention weights sufficient for precise boundary refinement via hierarchical embedding rests on the assumption that global context compensates for downsampled resolution. This is load-bearing for the rectification-free assertion; without explicit mechanisms (e.g., multi-scale fusion or upsampling details) to recover fine local boundaries amid multi-scale variation, the reported gains on Total-Text may not generalize.
[§4] §4 (Experiments): The abstract and results claim specific benchmark improvements (84.02% H-mean, +1.02% over GLASS) but provide no details on data splits, ablation studies, error bars, or run counts. This absence directly affects verification of whether the SAME module drives the gains or if post-hoc choices are involved, undermining confidence in the joint-optimization benefit.

minor comments (1)

Notation for the hierarchical embedding step could be clarified with a diagram or pseudocode to improve readability of the mask refinement process.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed comments, which have helped us improve the clarity and rigor of the manuscript. We address each major comment point by point below, providing explanations and indicating revisions made where appropriate.

read point-by-point responses

Referee: [§3] §3 (SAME module): The central claim that Transformer-encoded high-level features produce soft attention weights sufficient for precise boundary refinement via hierarchical embedding rests on the assumption that global context compensates for downsampled resolution. This is load-bearing for the rectification-free assertion; without explicit mechanisms (e.g., multi-scale fusion or upsampling details) to recover fine local boundaries amid multi-scale variation, the reported gains on Total-Text may not generalize.

Authors: We appreciate the referee's emphasis on this foundational aspect of the SAME module. The Transformer encoder is applied to high-level features precisely to leverage its global receptive field for capturing long-range context, which informs the computation of soft attention weights. These weights are then hierarchically embedded with the predicted masks across multiple stages, enabling progressive boundary refinement and background suppression without requiring character-level annotations or explicit rectification. This design choice allows the global context to compensate for resolution loss from downsampling, as the attention mechanism adaptively focuses on text-relevant regions amid multi-scale and arbitrary-shape variations. To address the concern directly, we have revised §3 to include an expanded explanation of the hierarchical embedding process, a new figure illustrating the multi-stage refinement, and an ablation study isolating the Transformer's contribution. While we maintain that the current architecture suffices for the reported gains on Total-Text (as the rectification-free pipeline achieves state-of-the-art results), we have added a note in the discussion acknowledging that explicit multi-scale fusion could be explored as future work. revision: partial
Referee: [§4] §4 (Experiments): The abstract and results claim specific benchmark improvements (84.02% H-mean, +1.02% over GLASS) but provide no details on data splits, ablation studies, error bars, or run counts. This absence directly affects verification of whether the SAME module drives the gains or if post-hoc choices are involved, undermining confidence in the joint-optimization benefit.

Authors: We agree that additional experimental details are critical for reproducibility and to substantiate the role of the SAME module in driving the observed improvements. In the revised manuscript, we have substantially expanded §4 with the following: explicit descriptions of the standard data splits and preprocessing for Total-Text and ICDAR 2015; comprehensive ablation studies (including tables) on key components such as the Transformer encoder, soft attention weights, and hierarchical embedding, demonstrating their individual and combined contributions; results reported as mean with standard deviation over three independent runs to provide error bars; and further clarification on the back-propagation path enabling joint detection-recognition optimization. These revisions confirm that the 84.02% H-mean and the 1.02% gain over GLASS are attributable to the proposed module rather than post-hoc decisions, thereby strengthening confidence in the joint-optimization benefit. revision: yes

Circularity Check

0 steps flagged

No circularity: architecture proposal validated empirically on external benchmarks

full rationale

The paper proposes the SAME module as a differentiable architectural component that computes soft attention from Transformer-encoded features and embeds it hierarchically with mask predictions to refine boundaries. This is presented as an engineering design choice, not a derivation that reduces to its own fitted parameters or prior self-citations. The central performance claims (84.02% H-mean on Total-Text, +1.02% over GLASS) are reported as outcomes of end-to-end training and evaluation on standard public datasets, with no equations or uniqueness theorems shown that would make the reported gains tautological by construction. The module is fully differentiable by design, allowing joint optimization, but this does not create a self-definitional loop. No load-bearing self-citation chains or renamed empirical patterns are evident in the provided description.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The approach relies on standard transformer attention and differentiability assumptions plus the new SAME module; no explicit free parameters or invented physical entities are described in the abstract.

axioms (1)

domain assumption Soft attention weights from Transformer encoders can be hierarchically embedded with masks to refine boundaries
Invoked in the description of the SAME module to suppress background noise without rectification.

invented entities (1)

Soft Attention Mask Embedding (SAME) module no independent evidence
purpose: Generate refined text-boundary-aware masks from rough proposals using transformer attention
New component introduced to enable rectification-free spotting; no independent evidence outside the paper is provided.

pith-pipeline@v0.9.0 · 5773 in / 1391 out tokens · 28067 ms · 2026-05-20T11:24:56.858341+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

74 extracted references · 74 canonical work pages · 4 internal anchors

[1]

A survey on table recognition technology,

L. Gao, Y . Li, L. Du, X. Zhang, Z. Zhu, N. Lu, L. Jin, Y . Huang, and Z. Tang, “A survey on table recognition technology,”Journal of Image and Graphics, vol. 27, no. 6, pp. 1898–1917, 2022

work page 1917
[2]

Deep learning methods for scene text detection and recognition,

C. Liu, X. Chen, C. Luo, L. Jin, Y . Xue, and Y . Liu, “Deep learning methods for scene text detection and recognition,” Journal of Image and Graphics, vol. 26, no. 6, pp. 1330– 1367, 2021

work page 2021
[3]

TextSquare: Scaling up text-centric visual instruction tuning,

J. Tang, C. Lin, Z. Zhao, S. Wei, B. Wu, Q. Liu, H. Feng, Y . Li, S. Wang, L. Liaoet al., “TextSquare: Scaling up text-centric visual instruction tuning,”arXiv preprint arXiv:2404.12803, 2024

work page arXiv 2024
[4]

An end-to-end trainable neural network for image-based sequence recognition and its ap- plication to scene text recognition,

B. Shi, X. Bai, and C. Yao, “An end-to-end trainable neural network for image-based sequence recognition and its ap- plication to scene text recognition,” vol. 39, no. 11. IEEE, 2017, pp. 2298–2304

work page 2017
[5]

Charac- ter region awareness for text detection,

Y . Baek, B. Lee, D. Han, S. Yun, and H. Lee, “Charac- ter region awareness for text detection,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pat- tern Recognition (CVPR). IEEE, 2019, pp. 9357–9366

work page 2019
[6]

GPT-4 Technical Report

J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkatet al., “GPT-4 technical report,”arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[7]

Denoising diffusion proba- bilistic models,

J. Ho, A. Jain, and P. Abbeel, “Denoising diffusion proba- bilistic models,”Advances in Neural Information Process- ing Systems, vol. 33, pp. 6840–6851, 2020

work page 2020
[8]

High-resolution image synthesis with la- tent diffusion models,

R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer, “High-resolution image synthesis with la- tent diffusion models,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 10 684–10 695

work page 2022
[9]

SETrans- former: A hybrid attention-based architecture for robust human activity recognition,

Y . Liu, X. Qin, Y . Gao, X. Li, and C. Feng, “SETrans- former: A hybrid attention-based architecture for robust human activity recognition,”INNO-PRESS: Journal of Emerging Applied AI, vol. 1, no. 1, 2025

work page 2025
[10]

Attention is all you need,

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” inAdvances in Neural Information Processing Systems (NeurIPS), vol. 30, 2017

work page 2017
[11]

End-to-end object detection with transformers,

N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kir- illov, and S. Zagoruyko, “End-to-end object detection with transformers,” inProceedings of the European Conference on Computer Vision (ECCV). Springer, 2020, pp. 213– 229

work page 2020
[12]

End-to-end scene text recognition,

K. Wang, B. Babenko, and S. Belongie, “End-to-end scene text recognition,” inProceedings of the International Con- ference on Computer Vision (ICCV). IEEE, 2011, pp. 1457–1464

work page 2011
[13]

Pho- toocr: Reading text in uncontrolled conditions,

A. Bissacco, M. Cummins, Y . Netzer, and H. Neven, “Pho- toocr: Reading text in uncontrolled conditions,” inPro- ceedings of the IEEE International Conference on Com- puter Vision (ICCV). IEEE, 2013, pp. 785–792

work page 2013
[14]

Textboxes: A fast text detector with a single deep neural network,

M. Liao, B. Shi, X. Bai, X. Wang, and W. Liu, “Textboxes: A fast text detector with a single deep neural network,” inProceedings of the 31st AAAI Conference on Artificial Intelligence. AAAI Press, 2017, pp. 4161–4167

work page 2017
[15]

Towards end-to-end text spotting with convolutional recurrent neural networks,

H. Li, P. Wang, and C. Shen, “Towards end-to-end text spotting with convolutional recurrent neural networks,” inProceedings of the IEEE International Conference on Computer Vision (ICCV). IEEE, 2017, pp. 5248–5256

work page 2017
[16]

Fots: Fast oriented text spotting with a unified network,

X. Liu, D. Liang, S. Yan, D. Chen, Y . Qiao, and J. Yan, “Fots: Fast oriented text spotting with a unified network,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2018, pp. 5676–5685

work page 2018
[17]

To- wards unconstrained end-to-end text spotting,

S. Qin, A. Bissaco, M. Raptis, Y . Fujii, and Y . Xiao, “To- wards unconstrained end-to-end text spotting,” inProceed- ings of the IEEE/CVF International Conference on Com- puter Vision (ICCV). IEEE, 2019, pp. 4703–4713

work page 2019
[18]

Dol- phin: Document image parsing via heterogeneous anchor prompting,

H. Feng, S. Wei, X. Fei, W. Shi, Y . Han, L. Liao, J. Lu, B. Wu, Q. Liu, C. Lin, J. Tang, and C. Huang, “Dol- phin: Document image parsing via heterogeneous anchor prompting,” inFindings of the Association for Computa- tional Linguistics: ACL 2025, 2025, pp. 21 919–21 936

work page 2025
[19]

UniDoc: A universal large multimodal model for simultaneous text detection, recognition, spotting and understanding,

H. Feng, Z. Wang, J. Tang, J. Lu, W. Zhou, H. Li, and C. Huang, “UniDoc: A universal large multimodal model for simultaneous text detection, recognition, spotting and understanding,”arXiv preprint arXiv:2308.11592, 2023

work page arXiv 2023
[20]

Swintextspotter: Scene text spotting via better synergy between text detection and text recog- nition,

M. Huang, Y . Liu, Z. Peng, C. Liu, D. Lin, S. Zhu, N. Yuan, K. Ding, and L. Jin, “Swintextspotter: Scene text spotting via better synergy between text detection and text recog- nition,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2022, pp. 4583–4593

work page 2022
[21]

Text spotting transformers,

X. Zhang, Y . Su, S. Tripathi, and Z. Tu, “Text spotting transformers,” inProceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition (CVPR). IEEE, 2022, pp. 9509–9518

work page 2022
[22]

Deepsolo: Let transformer decoder with explicit points solo for text spotting,

M. Ye, J. Zhang, S. Zhao, J. Liu, T. Liu, B. Du, and D. Tao, “Deepsolo: Let transformer decoder with explicit points solo for text spotting,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2023, pp. 19 348–19 357

work page 2023
[23]

You Only Look Once: Unified, Real-Time Object Detection

J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You only look once: Unified, real-time object detection,”arXiv preprint arXiv:1506.02640, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[24]

Trends and prospects of techniques for haze removal from degraded images: A survey,

G. Sahu, A. Seal, D. Bhattacharjee, M. Nasipuri, P. Brida, and O. Krejcar, “Trends and prospects of techniques for haze removal from degraded images: A survey,”IEEE Transactions on Emerging Topics in Computational Intel- ligence, vol. 6, no. 4, pp. 762–782, 2022

work page 2022
[25]

Mask textspotter: An end-to-end trainable neural network for 8 spotting text with arbitrary shapes,

P. Lyu, M. Liao, C. Yao, W. Wu, and X. Bai, “Mask textspotter: An end-to-end trainable neural network for 8 spotting text with arbitrary shapes,” inProceedings of the 15th European Conference on Computer Vision (ECCV). Springer, 2018, pp. 71–88

work page 2018
[26]

Mask textspotter v3: Segmentation proposal network for robust scene text spotting,

M. Liao, G. Pang, J. Huang, T. Hassner, and X. Bai, “Mask textspotter v3: Segmentation proposal network for robust scene text spotting,” inProceedings of the 16th European Conference on Computer Vision (ECCV). Springer, 2020, pp. 706–722

work page 2020
[27]

Textsnake: A flexible representation for detecting text of arbitrary shapes,

S. Long, J. Ruan, W. Zhang, X. He, W. Wu, and C. Yao, “Textsnake: A flexible representation for detecting text of arbitrary shapes,” inProceedings of the European Confer- ence on Computer Vision (ECCV). Springer, 2018, pp. 19–35

work page 2018
[28]

Few could be better than all: Feature sampling and grouping for scene text detection,

J. Tang, W. Zhang, H. Liu, M. Yang, B. Jiang, G. Hu, and X. Bai, “Few could be better than all: Feature sampling and grouping for scene text detection,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pat- tern Recognition, 2022, pp. 4563–4572

work page 2022
[29]

You can even annotate text with voice: Transcription- only-supervised text spotting,

J. Tang, S. Qiao, B. Cui, Y . Ma, S. Zhang, and D. Kanoulas, “You can even annotate text with voice: Transcription- only-supervised text spotting,” inProceedings of the 30th ACM International Conference on Multimedia, 2022, pp. 4154–4163

work page 2022
[30]

Arts: Eliminating inconsistency between text detection and recognition with auto-rectification text spotter,

H. Zhong, J. Tang, W. Wang, Z. Yang, C. Yao, and T. Lu, “Arts: Eliminating inconsistency between text detection and recognition with auto-rectification text spotter,”arXiv preprint arXiv:2110.10405, 2021

work page arXiv 2021
[31]

Abcnet v2: Adaptive bezier-curve network for real-time end-to-end text spotting,

Y . Liu, C. Shen, L. Jin, T. He, P. Chen, C. Liu, and H. Chen, “Abcnet v2: Adaptive bezier-curve network for real-time end-to-end text spotting,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 44, no. 11, pp. 8048–8064, 2022

work page 2022
[32]

Docpedia: Unleashing the power of large mul- timodal model in the frequency domain for versatile docu- ment understanding,

H. Feng, Q. Liu, H. Liu, J. Tang, W. Zhou, H. Li, and C. Huang, “Docpedia: Unleashing the power of large mul- timodal model in the frequency domain for versatile docu- ment understanding,”Science China Information Sciences, vol. 67, no. 12, pp. 1–14, 2024

work page 2024
[33]

Multi-modal in-context learning makes an ego-evolving scene text recognizer,

Z. Zhao, J. Tang, C. Lin, B. Wu, C. Huang, H. Liu, X. Tan, Z. Zhang, and Y . Xie, “Multi-modal in-context learning makes an ego-evolving scene text recognizer,” inProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 15 567–15 576

work page 2024
[34]

Harmonizing visual text comprehension and generation,

Z. Zhao, J. Tang, B. Wu, C. Lin, S. Wei, H. Liu, X. Tan, Z. Zhang, C. Huang, and Y . Xie, “Harmonizing visual text comprehension and generation,”arXiv preprint arXiv:2407.16364, 2024

work page arXiv 2024
[35]

Estextspotter: Towards better scene text spotting with explicit synergy in transformer,

M. Huang, J. Zhang, D. Peng, H. Lu, C. Huang, Y . Liu, X. Bai, and L. Jin, “Estextspotter: Towards better scene text spotting with explicit synergy in transformer,” inPro- ceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). IEEE, 2023, pp. 19 495–19 505

work page 2023
[36]

Feature pyramid networks for object detec- tion,

T.-Y . Lin, P. Doll´ar, R. Girshick, K. He, B. Hariharan, and S. Belongie, “Feature pyramid networks for object detec- tion,” inProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2017, pp. 936–944

work page 2017
[37]

Mtvqa: Benchmarking multilingual text-centric visual question answering

J. Tang, Q. Liu, Y . Ye, J. Lu, S. Wei, C. Lin, W. Li, M. F. F. B. Mahmood, H. Feng, Z. Zhaoet al., “MTVQA: Benchmarking multilingual text-centric visual question answering,”arXiv preprint arXiv:2405.11985, 2024

work page arXiv 2024
[38]

MCTBench: Multi- modal cognition towards text-rich visual scenes bench- mark,

B. Shan, X. Fei, W. Shi, A. Wang, G. Tang, L. Liao, J. Tang, X. Bai, and C. Huang, “MCTBench: Multi- modal cognition towards text-rich visual scenes bench- mark,”arXiv preprint arXiv:2410.11538, 2024

work page arXiv 2024
[39]

SPTS v2: Single-point scene text spotting,

Y . Liu, J. Zhang, D. Peng, M. Huang, X. Wang, J. Tang, C. Huang, D. Lin, C. Shen, X. Bai, and L. Jin, “SPTS v2: Single-point scene text spotting,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 45, no. 12, pp. 15 038–15 055, 2023

work page 2023
[40]

Optimal boxes: Boosting end-to-end scene text recogni- tion by adjusting annotated bounding boxes via reinforce- ment learning,

J. Tang, W. Qian, L. Song, X. Dong, L. Li, and X. Bai, “Optimal boxes: Boosting end-to-end scene text recogni- tion by adjusting annotated bounding boxes via reinforce- ment learning,” inEuropean Conference on Computer Vi- sion. Springer, 2022, pp. 233–248

work page 2022
[41]

A bounding box is worth one token: Interleaving layout and text in a large language model for document understanding,

J. Lu, H. Yu, Y . Wang, Y . Ye, J. Tang, Z. Yang, B. Wu, Q. Liu, H. Feng, H. Wanget al., “A bounding box is worth one token: Interleaving layout and text in a large language model for document understanding,” inFindings of the As- sociation for Computational Linguistics: ACL 2025, 2025, pp. 7252–7273

work page 2025
[42]

WildDoc: How far are we from achieving comprehensive and robust document un- derstanding in the wild?

A. Wang, J. Tang, L. Liao, H. Feng, Q. Liu, X. Fei, J. Lu, H. Wang, H. Liu, Y . Liuet al., “WildDoc: How far are we from achieving comprehensive and robust document un- derstanding in the wild?” inProceedings of the 2025 Con- ference on Empirical Methods in Natural Language Pro- cessing, 2025

work page 2025
[43]

Vision as LoRA,

H. Wang, Y . Ye, B. Li, Y . Nie, J. Lu, J. Tang, Y . Wang, and C. Huang, “Vision as LoRA,”arXiv preprint arXiv:2503.20680, 2025

work page arXiv 2025
[44]

OCRBench v2: An Improved Benchmark for Evaluating Large Multimodal Models on Visual Text Localization and Reasoning

L. Fu, Z. Kuang, J. Song, M. Huang, B. Yang, Y . Li, L. Zhu, Q. Luo, X. Wang, H. Lu, G. Tang, B. Shan, C. Lin, Q. Liu, B. Wu, H. Feng, H. Liu, C. Huang, J. Tang, W. Chen, L. Jin, Y . Liu, and X. Bai, “OCRBench v2: An improved benchmark for evaluating large multimodal models on visual text localization and reasoning,”arXiv preprint arXiv:2501.00321, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[45]

Character recognition competi- tion for street view shop signs,

J. Tang, W. Du, B. Wang, W. Zhou, S. Mei, T. Xue, X. Xu, and H. Zhang, “Character recognition competi- tion for street view shop signs,”National Science Review, vol. 10, no. 6, p. nwad141, 2023

work page 2023
[46]

Benchmark- ing vision-language models on chinese ancient documents: From OCR to knowledge reasoning,

H. Yu, Y . Wu, F. Shi, L. Liao, J. Lu, X. Ge, H. Wang, M. Zhuo, X. Wu, X. Fei, J. Tanget al., “Benchmark- ing vision-language models on chinese ancient documents: From OCR to knowledge reasoning,”arXiv preprint arXiv:2509.09731, 2025

work page arXiv 2025
[47]

Pargo: Bridging vision-language with partial and global views,

A.-L. Wang, B. Shan, W. Shi, K.-Y . Lin, X. Fei, G. Tang, L. Liao, J. Tang, C. Huang, and W.-S. Zheng, “Pargo: Bridging vision-language with partial and global views,” vol. 39, no. 7, pp. 7491–7499, 2025

work page 2025
[48]

Textdragon: An end-to-end framework for arbitrary 9 shaped text spotting,

W. Feng, W. He, F. Yin, X.-Y . Zhang, and C.-L. Liu, “Textdragon: An end-to-end framework for arbitrary 9 shaped text spotting,” inProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). IEEE, 2019, pp. 9075–9084

work page 2019
[49]

Abc- net: Real-time scene text spotting with adaptive bezier- curve network,

Y . Liu, H. Chen, C. Shen, T. He, L. Jin, and L. Wang, “Abc- net: Real-time scene text spotting with adaptive bezier- curve network,” inProceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition (CVPR). IEEE, 2020, pp. 9806–9815

work page 2020
[50]

MEML-GRPO: Heterogeneous multi-expert mutual learning for RLVR advancement,

W. Jia, J. Lu, H. Yu, S. Wang, G. Tang, A. Wang, W. Yin, D. Yang, Y . Nieet al., “MEML-GRPO: Heterogeneous multi-expert mutual learning for RLVR advancement,” in Proceedings of the AAAI Conference on Artificial Intelli- gence, vol. 40, no. 37, 2026, p. 31283

work page 2026
[51]

TC-Pad ´e: Trajectory- consistent Pad´e approximation for diffusion acceleration,

B. Cui, S. He, B. Huang, Z. Ye, Y . Sun, L. Huang, H. Xue, Y . Yang, J. Tanget al., “TC-Pad ´e: Trajectory- consistent Pad´e approximation for diffusion acceleration,” arXiv preprint arXiv:2603.02943, 2026

work page arXiv 2026
[52]

An end-to-end textspotter with explicit alignment and at- tention,

T. He, Z. Tian, W. Huang, C. Shen, Y . Qiao, and C. Sun, “An end-to-end textspotter with explicit alignment and at- tention,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2018, pp. 5020–5029

work page 2018
[53]

Advancing sequential numerical prediction in autoregressive models,

X. Fei, J. Lu, Q. Sun, H. Feng, Y . Wang, W. Shi, A. Wang, J. Tang, and C. Huang, “Advancing sequential numerical prediction in autoregressive models,” inProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics, 2025

work page 2025
[54]

Prolonged reason- ing is not all you need: Certainty-based adaptive rout- ing for efficient LLM/MLLM reasoning,

J. Lu, H. Yu, S. Xu, S. Ran, G. Tang, S. Wang, B. Shan, T. Fu, H. Feng, J. Tanget al., “Prolonged reason- ing is not all you need: Certainty-based adaptive rout- ing for efficient LLM/MLLM reasoning,”arXiv preprint arXiv:2505.15154, 2025

work page arXiv 2025
[55]

Attentive eraser: Unleashing diffusion model’s object removal po- tential via self-attention redirection guidance,

W. Sun, X.-M. Dong, B. Cui, and J. Tang, “Attentive eraser: Unleashing diffusion model’s object removal po- tential via self-attention redirection guidance,” vol. 39, no. 19, pp. 20 734–20 742, 2025

work page 2025
[56]

Real-time scene text detection with differentiable binarization and adaptive scale fusion,

M. Liao, Z. Zou, Z. Wan, C. Yao, and X. Bai, “Real-time scene text detection with differentiable binarization and adaptive scale fusion,” vol. 45, no. 1. IEEE, 2023, pp. 919–931

work page 2023
[57]

TabPedia: Towards comprehensive visual table understanding with concept synergy,

W. Zhao, H. Feng, Q. Liu, J. Tang, S. Wei, B. Wu, L. Liao, Y . Ye, H. Liu, W. Zhou, H. Li, and C. Huang, “TabPedia: Towards comprehensive visual table understanding with concept synergy,” inAdvances in Neural Information Pro- cessing Systems, vol. 37, 2024

work page 2024
[58]

Glass: Global to local attention for scene- text spotting,

R. Ronen, S. Tsiper, O. Anschel, I. Lavi, A. Markovitz, and R. Manmatha, “Glass: Global to local attention for scene- text spotting,”arXiv preprint arXiv:2208.03364, 2022

work page arXiv 2022
[59]

Dolphin-v2: Universal document parsing via scalable anchor prompting,

H. Feng, W. Shi, K. Zhang, X. Fei, L. Liao, D. Yang, Y . Du, X. Wu, J. Tang, Y . Liu, and X. Bai, “Dolphin-v2: Universal document parsing via scalable anchor prompting,” 2026

work page 2026
[60]

Resolving evidence sparsity: Agentic context engineer- ing for long-document understanding,

K. Liu, Z. Chen, M. Li, J. Tang, D. Yang, and L. Zhang, “Resolving evidence sparsity: Agentic context engineer- ing for long-document understanding,”arXiv preprint arXiv:2511.22850, 2025

work page arXiv 2025
[61]

Mindev: Multi-modal integrated diffusion framework for video reconstruction from eeg signals,

S. Huang, Y . Wang, H. Luo, H. Jing, C. Qin, and J. Tang, “Mindev: Multi-modal integrated diffusion framework for video reconstruction from eeg signals,” pp. 3350–3359, 2025

work page 2025
[62]

Swin transformer: Hierarchical vision trans- former using shifted windows,

Z. Liu, Y . Lin, Y . Cao, H. Hu, Y . Wei, Z. Zhang, S. Lin, and B. Guo, “Swin transformer: Hierarchical vision trans- former using shifted windows,” inProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). IEEE, 2021, pp. 9992–10 002

work page 2021
[63]

Deep residual learn- ing for image recognition,

K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learn- ing for image recognition,” inProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2016, pp. 770–778

work page 2016
[64]

Focal loss for dense object detection,

T.-Y . Lin, P. Goyal, R. Girshick, K. He, and P. Doll ´ar, “Focal loss for dense object detection,” inProceedings of the IEEE International Conference on Computer Vision (ICCV). IEEE, 2017, pp. 2999–3007

work page 2017
[65]

Generalized intersection over union: A metric and a loss for bounding box regression,

H. Rezatofighi, N. Tsoi, J. Gwak, A. Sadeghian, I. Reid, and S. Savarese, “Generalized intersection over union: A metric and a loss for bounding box regression,” inProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2019, pp. 658–666

work page 2019
[66]

Istr: End-to-end instance segmenta- tion with transformers,

J. Hu, L. Cao, Y . Lu, S. Zhang, Y . Wang, K. Li, F. Huang, L. Shao, and R. Ji, “Istr: End-to-end instance segmenta- tion with transformers,”arXiv preprint arXiv:2105.00637, 2021

work page arXiv 2021
[67]

Synthetic Data for Text Localisation in Natural Images

A. Gupta, A. Vedaldi, and A. Zisserman, “Synthetic data for text localization in natural images,”arXiv preprint arXiv:1604.06646, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[68]

Icdar 2015 competition on robust reading,

D. Karatzas, L. Gomez-Bigorda, A. Nicolaou, S. Ghosh, A. Bagdanov, M. Iwamura, J. Matas, L. Neumann, V . R. Chandrasekhar, S. Lu, F. Shafait, S. Uchida, and E. Val- veny, “Icdar 2015 competition on robust reading,” inPro- ceedings of the 13th International Conference on Docu- ment Analysis and Recognition (ICDAR). IEEE, 2015, pp. 1156–1160

work page 2015
[69]

Total-text: A comprehen- sive dataset for scene text detection and recognition,

C. K. Ch’ng and C. S. Chan, “Total-text: A comprehen- sive dataset for scene text detection and recognition,” in Proceedings of the 14th IAPR International Conference on Document Analysis and Recognition (ICDAR). IEEE, 2017, pp. 935–942

work page 2017
[70]

Mango: A mask attention guided one-stage scene text spotter,

L. Qiao, Y . Chen, Z. Cheng, Y . Xu, Y . Niu, S. Pu, and F. Wu, “Mango: A mask attention guided one-stage scene text spotter,”arXiv preprint arXiv:2012.04350, 2021

work page arXiv 2012
[71]

All you need is boundary: Toward arbitrary-shaped text spotting,

H. Wang, P. Lu, H. Zhang, M. Yang, X. Bai, Y . Xu, M. He, Y . Wang, and W. Liu, “All you need is boundary: Toward arbitrary-shaped text spotting,” inProceedings of the 34th AAAI Conference on Artificial Intelligence. AAAI Press, 2020, pp. 12 160–12 167

work page 2020
[72]

Pgnet: Real-time arbitrarily- shaped text spotting with point gathering network,

P. Wang, C. Zhang, F. Qi, S. Liu, X. Zhang, P. Lyu, J. Han, J. Liu, E. Ding, and G. Shi, “Pgnet: Real-time arbitrarily- shaped text spotting with point gathering network,”arXiv preprint arXiv:2104.05458, 2021

work page arXiv 2021
[73]

Spts: Single-point text spotting,

D. Du, X. Chen, J. Peng, J. Liu, D. Peng, and L. Jin, “Spts: Single-point text spotting,” inProceedings of the 30th 10 ACM International Conference on Multimedia. ACM, 2022, pp. 4272–4281

work page 2022
[74]

Pan++: Towards efficient and accurate end- to-end spotting of arbitrarily-shaped text,

W. Wang, E. Xie, X. Li, X. Liu, D. Liang, Z. Yang, T. Lu, and C. Shen, “Pan++: Towards efficient and accurate end- to-end spotting of arbitrarily-shaped text,”IEEE Trans- actions on Pattern Analysis and Machine Intelligence, vol. 44, no. 9, pp. 5349–5367, 2022. 11

work page 2022

[1] [1]

A survey on table recognition technology,

L. Gao, Y . Li, L. Du, X. Zhang, Z. Zhu, N. Lu, L. Jin, Y . Huang, and Z. Tang, “A survey on table recognition technology,”Journal of Image and Graphics, vol. 27, no. 6, pp. 1898–1917, 2022

work page 1917

[2] [2]

Deep learning methods for scene text detection and recognition,

C. Liu, X. Chen, C. Luo, L. Jin, Y . Xue, and Y . Liu, “Deep learning methods for scene text detection and recognition,” Journal of Image and Graphics, vol. 26, no. 6, pp. 1330– 1367, 2021

work page 2021

[3] [3]

TextSquare: Scaling up text-centric visual instruction tuning,

J. Tang, C. Lin, Z. Zhao, S. Wei, B. Wu, Q. Liu, H. Feng, Y . Li, S. Wang, L. Liaoet al., “TextSquare: Scaling up text-centric visual instruction tuning,”arXiv preprint arXiv:2404.12803, 2024

work page arXiv 2024

[4] [4]

An end-to-end trainable neural network for image-based sequence recognition and its ap- plication to scene text recognition,

B. Shi, X. Bai, and C. Yao, “An end-to-end trainable neural network for image-based sequence recognition and its ap- plication to scene text recognition,” vol. 39, no. 11. IEEE, 2017, pp. 2298–2304

work page 2017

[5] [5]

Charac- ter region awareness for text detection,

Y . Baek, B. Lee, D. Han, S. Yun, and H. Lee, “Charac- ter region awareness for text detection,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pat- tern Recognition (CVPR). IEEE, 2019, pp. 9357–9366

work page 2019

[6] [6]

GPT-4 Technical Report

J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkatet al., “GPT-4 technical report,”arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[7] [7]

Denoising diffusion proba- bilistic models,

J. Ho, A. Jain, and P. Abbeel, “Denoising diffusion proba- bilistic models,”Advances in Neural Information Process- ing Systems, vol. 33, pp. 6840–6851, 2020

work page 2020

[8] [8]

High-resolution image synthesis with la- tent diffusion models,

R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer, “High-resolution image synthesis with la- tent diffusion models,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 10 684–10 695

work page 2022

[9] [9]

SETrans- former: A hybrid attention-based architecture for robust human activity recognition,

Y . Liu, X. Qin, Y . Gao, X. Li, and C. Feng, “SETrans- former: A hybrid attention-based architecture for robust human activity recognition,”INNO-PRESS: Journal of Emerging Applied AI, vol. 1, no. 1, 2025

work page 2025

[10] [10]

Attention is all you need,

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” inAdvances in Neural Information Processing Systems (NeurIPS), vol. 30, 2017

work page 2017

[11] [11]

End-to-end object detection with transformers,

N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kir- illov, and S. Zagoruyko, “End-to-end object detection with transformers,” inProceedings of the European Conference on Computer Vision (ECCV). Springer, 2020, pp. 213– 229

work page 2020

[12] [12]

End-to-end scene text recognition,

K. Wang, B. Babenko, and S. Belongie, “End-to-end scene text recognition,” inProceedings of the International Con- ference on Computer Vision (ICCV). IEEE, 2011, pp. 1457–1464

work page 2011

[13] [13]

Pho- toocr: Reading text in uncontrolled conditions,

A. Bissacco, M. Cummins, Y . Netzer, and H. Neven, “Pho- toocr: Reading text in uncontrolled conditions,” inPro- ceedings of the IEEE International Conference on Com- puter Vision (ICCV). IEEE, 2013, pp. 785–792

work page 2013

[14] [14]

Textboxes: A fast text detector with a single deep neural network,

M. Liao, B. Shi, X. Bai, X. Wang, and W. Liu, “Textboxes: A fast text detector with a single deep neural network,” inProceedings of the 31st AAAI Conference on Artificial Intelligence. AAAI Press, 2017, pp. 4161–4167

work page 2017

[15] [15]

Towards end-to-end text spotting with convolutional recurrent neural networks,

H. Li, P. Wang, and C. Shen, “Towards end-to-end text spotting with convolutional recurrent neural networks,” inProceedings of the IEEE International Conference on Computer Vision (ICCV). IEEE, 2017, pp. 5248–5256

work page 2017

[16] [16]

Fots: Fast oriented text spotting with a unified network,

X. Liu, D. Liang, S. Yan, D. Chen, Y . Qiao, and J. Yan, “Fots: Fast oriented text spotting with a unified network,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2018, pp. 5676–5685

work page 2018

[17] [17]

To- wards unconstrained end-to-end text spotting,

S. Qin, A. Bissaco, M. Raptis, Y . Fujii, and Y . Xiao, “To- wards unconstrained end-to-end text spotting,” inProceed- ings of the IEEE/CVF International Conference on Com- puter Vision (ICCV). IEEE, 2019, pp. 4703–4713

work page 2019

[18] [18]

Dol- phin: Document image parsing via heterogeneous anchor prompting,

H. Feng, S. Wei, X. Fei, W. Shi, Y . Han, L. Liao, J. Lu, B. Wu, Q. Liu, C. Lin, J. Tang, and C. Huang, “Dol- phin: Document image parsing via heterogeneous anchor prompting,” inFindings of the Association for Computa- tional Linguistics: ACL 2025, 2025, pp. 21 919–21 936

work page 2025

[19] [19]

UniDoc: A universal large multimodal model for simultaneous text detection, recognition, spotting and understanding,

H. Feng, Z. Wang, J. Tang, J. Lu, W. Zhou, H. Li, and C. Huang, “UniDoc: A universal large multimodal model for simultaneous text detection, recognition, spotting and understanding,”arXiv preprint arXiv:2308.11592, 2023

work page arXiv 2023

[20] [20]

Swintextspotter: Scene text spotting via better synergy between text detection and text recog- nition,

M. Huang, Y . Liu, Z. Peng, C. Liu, D. Lin, S. Zhu, N. Yuan, K. Ding, and L. Jin, “Swintextspotter: Scene text spotting via better synergy between text detection and text recog- nition,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2022, pp. 4583–4593

work page 2022

[21] [21]

Text spotting transformers,

X. Zhang, Y . Su, S. Tripathi, and Z. Tu, “Text spotting transformers,” inProceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition (CVPR). IEEE, 2022, pp. 9509–9518

work page 2022

[22] [22]

Deepsolo: Let transformer decoder with explicit points solo for text spotting,

M. Ye, J. Zhang, S. Zhao, J. Liu, T. Liu, B. Du, and D. Tao, “Deepsolo: Let transformer decoder with explicit points solo for text spotting,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2023, pp. 19 348–19 357

work page 2023

[23] [23]

You Only Look Once: Unified, Real-Time Object Detection

J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You only look once: Unified, real-time object detection,”arXiv preprint arXiv:1506.02640, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016

[24] [24]

Trends and prospects of techniques for haze removal from degraded images: A survey,

G. Sahu, A. Seal, D. Bhattacharjee, M. Nasipuri, P. Brida, and O. Krejcar, “Trends and prospects of techniques for haze removal from degraded images: A survey,”IEEE Transactions on Emerging Topics in Computational Intel- ligence, vol. 6, no. 4, pp. 762–782, 2022

work page 2022

[25] [25]

Mask textspotter: An end-to-end trainable neural network for 8 spotting text with arbitrary shapes,

P. Lyu, M. Liao, C. Yao, W. Wu, and X. Bai, “Mask textspotter: An end-to-end trainable neural network for 8 spotting text with arbitrary shapes,” inProceedings of the 15th European Conference on Computer Vision (ECCV). Springer, 2018, pp. 71–88

work page 2018

[26] [26]

Mask textspotter v3: Segmentation proposal network for robust scene text spotting,

M. Liao, G. Pang, J. Huang, T. Hassner, and X. Bai, “Mask textspotter v3: Segmentation proposal network for robust scene text spotting,” inProceedings of the 16th European Conference on Computer Vision (ECCV). Springer, 2020, pp. 706–722

work page 2020

[27] [27]

Textsnake: A flexible representation for detecting text of arbitrary shapes,

S. Long, J. Ruan, W. Zhang, X. He, W. Wu, and C. Yao, “Textsnake: A flexible representation for detecting text of arbitrary shapes,” inProceedings of the European Confer- ence on Computer Vision (ECCV). Springer, 2018, pp. 19–35

work page 2018

[28] [28]

Few could be better than all: Feature sampling and grouping for scene text detection,

J. Tang, W. Zhang, H. Liu, M. Yang, B. Jiang, G. Hu, and X. Bai, “Few could be better than all: Feature sampling and grouping for scene text detection,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pat- tern Recognition, 2022, pp. 4563–4572

work page 2022

[29] [29]

You can even annotate text with voice: Transcription- only-supervised text spotting,

J. Tang, S. Qiao, B. Cui, Y . Ma, S. Zhang, and D. Kanoulas, “You can even annotate text with voice: Transcription- only-supervised text spotting,” inProceedings of the 30th ACM International Conference on Multimedia, 2022, pp. 4154–4163

work page 2022

[30] [30]

Arts: Eliminating inconsistency between text detection and recognition with auto-rectification text spotter,

H. Zhong, J. Tang, W. Wang, Z. Yang, C. Yao, and T. Lu, “Arts: Eliminating inconsistency between text detection and recognition with auto-rectification text spotter,”arXiv preprint arXiv:2110.10405, 2021

work page arXiv 2021

[31] [31]

Abcnet v2: Adaptive bezier-curve network for real-time end-to-end text spotting,

Y . Liu, C. Shen, L. Jin, T. He, P. Chen, C. Liu, and H. Chen, “Abcnet v2: Adaptive bezier-curve network for real-time end-to-end text spotting,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 44, no. 11, pp. 8048–8064, 2022

work page 2022

[32] [32]

Docpedia: Unleashing the power of large mul- timodal model in the frequency domain for versatile docu- ment understanding,

H. Feng, Q. Liu, H. Liu, J. Tang, W. Zhou, H. Li, and C. Huang, “Docpedia: Unleashing the power of large mul- timodal model in the frequency domain for versatile docu- ment understanding,”Science China Information Sciences, vol. 67, no. 12, pp. 1–14, 2024

work page 2024

[33] [33]

Multi-modal in-context learning makes an ego-evolving scene text recognizer,

Z. Zhao, J. Tang, C. Lin, B. Wu, C. Huang, H. Liu, X. Tan, Z. Zhang, and Y . Xie, “Multi-modal in-context learning makes an ego-evolving scene text recognizer,” inProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 15 567–15 576

work page 2024

[34] [34]

Harmonizing visual text comprehension and generation,

Z. Zhao, J. Tang, B. Wu, C. Lin, S. Wei, H. Liu, X. Tan, Z. Zhang, C. Huang, and Y . Xie, “Harmonizing visual text comprehension and generation,”arXiv preprint arXiv:2407.16364, 2024

work page arXiv 2024

[35] [35]

Estextspotter: Towards better scene text spotting with explicit synergy in transformer,

M. Huang, J. Zhang, D. Peng, H. Lu, C. Huang, Y . Liu, X. Bai, and L. Jin, “Estextspotter: Towards better scene text spotting with explicit synergy in transformer,” inPro- ceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). IEEE, 2023, pp. 19 495–19 505

work page 2023

[36] [36]

Feature pyramid networks for object detec- tion,

T.-Y . Lin, P. Doll´ar, R. Girshick, K. He, B. Hariharan, and S. Belongie, “Feature pyramid networks for object detec- tion,” inProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2017, pp. 936–944

work page 2017

[37] [37]

Mtvqa: Benchmarking multilingual text-centric visual question answering

J. Tang, Q. Liu, Y . Ye, J. Lu, S. Wei, C. Lin, W. Li, M. F. F. B. Mahmood, H. Feng, Z. Zhaoet al., “MTVQA: Benchmarking multilingual text-centric visual question answering,”arXiv preprint arXiv:2405.11985, 2024

work page arXiv 2024

[38] [38]

MCTBench: Multi- modal cognition towards text-rich visual scenes bench- mark,

B. Shan, X. Fei, W. Shi, A. Wang, G. Tang, L. Liao, J. Tang, X. Bai, and C. Huang, “MCTBench: Multi- modal cognition towards text-rich visual scenes bench- mark,”arXiv preprint arXiv:2410.11538, 2024

work page arXiv 2024

[39] [39]

SPTS v2: Single-point scene text spotting,

Y . Liu, J. Zhang, D. Peng, M. Huang, X. Wang, J. Tang, C. Huang, D. Lin, C. Shen, X. Bai, and L. Jin, “SPTS v2: Single-point scene text spotting,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 45, no. 12, pp. 15 038–15 055, 2023

work page 2023

[40] [40]

Optimal boxes: Boosting end-to-end scene text recogni- tion by adjusting annotated bounding boxes via reinforce- ment learning,

J. Tang, W. Qian, L. Song, X. Dong, L. Li, and X. Bai, “Optimal boxes: Boosting end-to-end scene text recogni- tion by adjusting annotated bounding boxes via reinforce- ment learning,” inEuropean Conference on Computer Vi- sion. Springer, 2022, pp. 233–248

work page 2022

[41] [41]

A bounding box is worth one token: Interleaving layout and text in a large language model for document understanding,

J. Lu, H. Yu, Y . Wang, Y . Ye, J. Tang, Z. Yang, B. Wu, Q. Liu, H. Feng, H. Wanget al., “A bounding box is worth one token: Interleaving layout and text in a large language model for document understanding,” inFindings of the As- sociation for Computational Linguistics: ACL 2025, 2025, pp. 7252–7273

work page 2025

[42] [42]

WildDoc: How far are we from achieving comprehensive and robust document un- derstanding in the wild?

A. Wang, J. Tang, L. Liao, H. Feng, Q. Liu, X. Fei, J. Lu, H. Wang, H. Liu, Y . Liuet al., “WildDoc: How far are we from achieving comprehensive and robust document un- derstanding in the wild?” inProceedings of the 2025 Con- ference on Empirical Methods in Natural Language Pro- cessing, 2025

work page 2025

[43] [43]

Vision as LoRA,

H. Wang, Y . Ye, B. Li, Y . Nie, J. Lu, J. Tang, Y . Wang, and C. Huang, “Vision as LoRA,”arXiv preprint arXiv:2503.20680, 2025

work page arXiv 2025

[44] [44]

OCRBench v2: An Improved Benchmark for Evaluating Large Multimodal Models on Visual Text Localization and Reasoning

L. Fu, Z. Kuang, J. Song, M. Huang, B. Yang, Y . Li, L. Zhu, Q. Luo, X. Wang, H. Lu, G. Tang, B. Shan, C. Lin, Q. Liu, B. Wu, H. Feng, H. Liu, C. Huang, J. Tang, W. Chen, L. Jin, Y . Liu, and X. Bai, “OCRBench v2: An improved benchmark for evaluating large multimodal models on visual text localization and reasoning,”arXiv preprint arXiv:2501.00321, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[45] [45]

Character recognition competi- tion for street view shop signs,

J. Tang, W. Du, B. Wang, W. Zhou, S. Mei, T. Xue, X. Xu, and H. Zhang, “Character recognition competi- tion for street view shop signs,”National Science Review, vol. 10, no. 6, p. nwad141, 2023

work page 2023

[46] [46]

Benchmark- ing vision-language models on chinese ancient documents: From OCR to knowledge reasoning,

H. Yu, Y . Wu, F. Shi, L. Liao, J. Lu, X. Ge, H. Wang, M. Zhuo, X. Wu, X. Fei, J. Tanget al., “Benchmark- ing vision-language models on chinese ancient documents: From OCR to knowledge reasoning,”arXiv preprint arXiv:2509.09731, 2025

work page arXiv 2025

[47] [47]

Pargo: Bridging vision-language with partial and global views,

A.-L. Wang, B. Shan, W. Shi, K.-Y . Lin, X. Fei, G. Tang, L. Liao, J. Tang, C. Huang, and W.-S. Zheng, “Pargo: Bridging vision-language with partial and global views,” vol. 39, no. 7, pp. 7491–7499, 2025

work page 2025

[48] [48]

Textdragon: An end-to-end framework for arbitrary 9 shaped text spotting,

W. Feng, W. He, F. Yin, X.-Y . Zhang, and C.-L. Liu, “Textdragon: An end-to-end framework for arbitrary 9 shaped text spotting,” inProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). IEEE, 2019, pp. 9075–9084

work page 2019

[49] [49]

Abc- net: Real-time scene text spotting with adaptive bezier- curve network,

Y . Liu, H. Chen, C. Shen, T. He, L. Jin, and L. Wang, “Abc- net: Real-time scene text spotting with adaptive bezier- curve network,” inProceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition (CVPR). IEEE, 2020, pp. 9806–9815

work page 2020

[50] [50]

MEML-GRPO: Heterogeneous multi-expert mutual learning for RLVR advancement,

W. Jia, J. Lu, H. Yu, S. Wang, G. Tang, A. Wang, W. Yin, D. Yang, Y . Nieet al., “MEML-GRPO: Heterogeneous multi-expert mutual learning for RLVR advancement,” in Proceedings of the AAAI Conference on Artificial Intelli- gence, vol. 40, no. 37, 2026, p. 31283

work page 2026

[51] [51]

TC-Pad ´e: Trajectory- consistent Pad´e approximation for diffusion acceleration,

B. Cui, S. He, B. Huang, Z. Ye, Y . Sun, L. Huang, H. Xue, Y . Yang, J. Tanget al., “TC-Pad ´e: Trajectory- consistent Pad´e approximation for diffusion acceleration,” arXiv preprint arXiv:2603.02943, 2026

work page arXiv 2026

[52] [52]

An end-to-end textspotter with explicit alignment and at- tention,

T. He, Z. Tian, W. Huang, C. Shen, Y . Qiao, and C. Sun, “An end-to-end textspotter with explicit alignment and at- tention,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2018, pp. 5020–5029

work page 2018

[53] [53]

Advancing sequential numerical prediction in autoregressive models,

X. Fei, J. Lu, Q. Sun, H. Feng, Y . Wang, W. Shi, A. Wang, J. Tang, and C. Huang, “Advancing sequential numerical prediction in autoregressive models,” inProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics, 2025

work page 2025

[54] [54]

Prolonged reason- ing is not all you need: Certainty-based adaptive rout- ing for efficient LLM/MLLM reasoning,

J. Lu, H. Yu, S. Xu, S. Ran, G. Tang, S. Wang, B. Shan, T. Fu, H. Feng, J. Tanget al., “Prolonged reason- ing is not all you need: Certainty-based adaptive rout- ing for efficient LLM/MLLM reasoning,”arXiv preprint arXiv:2505.15154, 2025

work page arXiv 2025

[55] [55]

Attentive eraser: Unleashing diffusion model’s object removal po- tential via self-attention redirection guidance,

W. Sun, X.-M. Dong, B. Cui, and J. Tang, “Attentive eraser: Unleashing diffusion model’s object removal po- tential via self-attention redirection guidance,” vol. 39, no. 19, pp. 20 734–20 742, 2025

work page 2025

[56] [56]

Real-time scene text detection with differentiable binarization and adaptive scale fusion,

M. Liao, Z. Zou, Z. Wan, C. Yao, and X. Bai, “Real-time scene text detection with differentiable binarization and adaptive scale fusion,” vol. 45, no. 1. IEEE, 2023, pp. 919–931

work page 2023

[57] [57]

TabPedia: Towards comprehensive visual table understanding with concept synergy,

W. Zhao, H. Feng, Q. Liu, J. Tang, S. Wei, B. Wu, L. Liao, Y . Ye, H. Liu, W. Zhou, H. Li, and C. Huang, “TabPedia: Towards comprehensive visual table understanding with concept synergy,” inAdvances in Neural Information Pro- cessing Systems, vol. 37, 2024

work page 2024

[58] [58]

Glass: Global to local attention for scene- text spotting,

R. Ronen, S. Tsiper, O. Anschel, I. Lavi, A. Markovitz, and R. Manmatha, “Glass: Global to local attention for scene- text spotting,”arXiv preprint arXiv:2208.03364, 2022

work page arXiv 2022

[59] [59]

Dolphin-v2: Universal document parsing via scalable anchor prompting,

H. Feng, W. Shi, K. Zhang, X. Fei, L. Liao, D. Yang, Y . Du, X. Wu, J. Tang, Y . Liu, and X. Bai, “Dolphin-v2: Universal document parsing via scalable anchor prompting,” 2026

work page 2026

[60] [60]

Resolving evidence sparsity: Agentic context engineer- ing for long-document understanding,

K. Liu, Z. Chen, M. Li, J. Tang, D. Yang, and L. Zhang, “Resolving evidence sparsity: Agentic context engineer- ing for long-document understanding,”arXiv preprint arXiv:2511.22850, 2025

work page arXiv 2025

[61] [61]

Mindev: Multi-modal integrated diffusion framework for video reconstruction from eeg signals,

S. Huang, Y . Wang, H. Luo, H. Jing, C. Qin, and J. Tang, “Mindev: Multi-modal integrated diffusion framework for video reconstruction from eeg signals,” pp. 3350–3359, 2025

work page 2025

[62] [62]

Swin transformer: Hierarchical vision trans- former using shifted windows,

Z. Liu, Y . Lin, Y . Cao, H. Hu, Y . Wei, Z. Zhang, S. Lin, and B. Guo, “Swin transformer: Hierarchical vision trans- former using shifted windows,” inProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). IEEE, 2021, pp. 9992–10 002

work page 2021

[63] [63]

Deep residual learn- ing for image recognition,

K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learn- ing for image recognition,” inProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2016, pp. 770–778

work page 2016

[64] [64]

Focal loss for dense object detection,

T.-Y . Lin, P. Goyal, R. Girshick, K. He, and P. Doll ´ar, “Focal loss for dense object detection,” inProceedings of the IEEE International Conference on Computer Vision (ICCV). IEEE, 2017, pp. 2999–3007

work page 2017

[65] [65]

Generalized intersection over union: A metric and a loss for bounding box regression,

H. Rezatofighi, N. Tsoi, J. Gwak, A. Sadeghian, I. Reid, and S. Savarese, “Generalized intersection over union: A metric and a loss for bounding box regression,” inProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2019, pp. 658–666

work page 2019

[66] [66]

Istr: End-to-end instance segmenta- tion with transformers,

J. Hu, L. Cao, Y . Lu, S. Zhang, Y . Wang, K. Li, F. Huang, L. Shao, and R. Ji, “Istr: End-to-end instance segmenta- tion with transformers,”arXiv preprint arXiv:2105.00637, 2021

work page arXiv 2021

[67] [67]

Synthetic Data for Text Localisation in Natural Images

A. Gupta, A. Vedaldi, and A. Zisserman, “Synthetic data for text localization in natural images,”arXiv preprint arXiv:1604.06646, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016

[68] [68]

Icdar 2015 competition on robust reading,

D. Karatzas, L. Gomez-Bigorda, A. Nicolaou, S. Ghosh, A. Bagdanov, M. Iwamura, J. Matas, L. Neumann, V . R. Chandrasekhar, S. Lu, F. Shafait, S. Uchida, and E. Val- veny, “Icdar 2015 competition on robust reading,” inPro- ceedings of the 13th International Conference on Docu- ment Analysis and Recognition (ICDAR). IEEE, 2015, pp. 1156–1160

work page 2015

[69] [69]

Total-text: A comprehen- sive dataset for scene text detection and recognition,

C. K. Ch’ng and C. S. Chan, “Total-text: A comprehen- sive dataset for scene text detection and recognition,” in Proceedings of the 14th IAPR International Conference on Document Analysis and Recognition (ICDAR). IEEE, 2017, pp. 935–942

work page 2017

[70] [70]

Mango: A mask attention guided one-stage scene text spotter,

L. Qiao, Y . Chen, Z. Cheng, Y . Xu, Y . Niu, S. Pu, and F. Wu, “Mango: A mask attention guided one-stage scene text spotter,”arXiv preprint arXiv:2012.04350, 2021

work page arXiv 2012

[71] [71]

All you need is boundary: Toward arbitrary-shaped text spotting,

H. Wang, P. Lu, H. Zhang, M. Yang, X. Bai, Y . Xu, M. He, Y . Wang, and W. Liu, “All you need is boundary: Toward arbitrary-shaped text spotting,” inProceedings of the 34th AAAI Conference on Artificial Intelligence. AAAI Press, 2020, pp. 12 160–12 167

work page 2020

[72] [72]

Pgnet: Real-time arbitrarily- shaped text spotting with point gathering network,

P. Wang, C. Zhang, F. Qi, S. Liu, X. Zhang, P. Lyu, J. Han, J. Liu, E. Ding, and G. Shi, “Pgnet: Real-time arbitrarily- shaped text spotting with point gathering network,”arXiv preprint arXiv:2104.05458, 2021

work page arXiv 2021

[73] [73]

Spts: Single-point text spotting,

D. Du, X. Chen, J. Peng, J. Liu, D. Peng, and L. Jin, “Spts: Single-point text spotting,” inProceedings of the 30th 10 ACM International Conference on Multimedia. ACM, 2022, pp. 4272–4281

work page 2022

[74] [74]

Pan++: Towards efficient and accurate end- to-end spotting of arbitrarily-shaped text,

W. Wang, E. Xie, X. Li, X. Liu, D. Liang, Z. Yang, T. Lu, and C. Shen, “Pan++: Towards efficient and accurate end- to-end spotting of arbitrarily-shaped text,”IEEE Trans- actions on Pattern Analysis and Machine Intelligence, vol. 44, no. 9, pp. 5349–5367, 2022. 11

work page 2022