Do You Need Text Rectification? Soft Attention Mask Embedding for Rectification-Free Scene Text Spotting
Pith reviewed 2026-05-20 11:24 UTC · model grok-4.3
The pith
Soft attention weights from transformers refine text masks to enable accurate spotting without any rectification step.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By computing soft attention weights from Transformer-encoded high-level features and hierarchically embedding them with predicted masks, the SAME module produces refined text-boundary-aware masks that suppress background noise, allowing a single network to perform robust end-to-end scene text spotting without character-level annotations or auxiliary rectification modules.
What carries the argument
The Soft Attention Mask Embedding (SAME) module, which uses Transformer encoders to generate soft attention weights and embeds them hierarchically with mask predictions to refine text boundaries and reduce noise.
If this is right
- Joint training of detection and recognition becomes possible because the module is fully differentiable and passes recognition gradients to the detection branch.
- No character-level annotations or separate rectification modules are required while still handling arbitrarily shaped and multi-oriented text.
- Accuracy gains appear on curved-text benchmarks without using extra training data beyond standard sets.
- The same pipeline delivers competitive results on multi-oriented text datasets while removing the rectification component.
Where Pith is reading between the lines
- The same attention-based refinement idea could be tested on other tasks involving irregular shapes, such as segmenting curved structures in medical scans or aerial imagery.
- Removing explicit rectification may lower overall model complexity and inference time for real-time mobile text spotting applications.
- If the mechanism proves robust across domains, it could reduce reliance on geometric transformations in broader irregular-object recognition pipelines.
Load-bearing premise
Soft attention weights derived from high-level Transformer features can be embedded with masks to create boundary-aware refinements that reliably separate arbitrary-shaped text from complex backgrounds.
What would settle it
Measure end-to-end accuracy of SAME-Net against an otherwise identical network that adds an explicit rectification branch on a dataset containing extreme perspective warps and heavy background clutter; if the rectification version wins by a clear margin, the claim weakens.
read the original abstract
End-to-end scene text spotting, which unifies text detection and recognition within a single framework, has witnessed remarkable progress driven by deep learning advances. However, most existing approaches still suffer from incomplete mask proposals caused by multi-scale variation, arbitrary text shapes, and complex background interference, thereby degrading recognition accuracy. In this paper, we propose a novel Soft Attention Mask Embedding module (SAME) that leverages the global receptive field of Transformer encoders to encode high-level features and compute soft attention weights, which are then hierarchically embedded with predicted masks to generate refined text-boundary-aware masks that effectively suppress background noise. Building upon this module, we present SAME-Net, a robust end-to-end text spotting framework that requires neither character-level annotations nor auxiliary text rectification modules. Since the soft attention mechanism is fully differentiable, recognition loss gradients can be back-propagated through the SAME module to the detection branch, enabling joint optimization of detection and recognition objectives. Extensive experiments on challenging benchmarks demonstrate the effectiveness of our approach: SAME-Net achieves 84.02\% end-to-end H-mean on the arbitrarily-shaped Total-Text dataset, surpassing the previous state-of-the-art GLASS by 1.02\% in full-lexicon accuracy without additional training data, and obtains competitive 83.4\% strong-lexicon results on the multi-oriented ICDAR 2015 dataset.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes the Soft Attention Mask Embedding (SAME) module for rectification-free end-to-end scene text spotting. SAME uses Transformer encoders on high-level features to compute soft attention weights that are hierarchically embedded with predicted masks, yielding refined text-boundary-aware masks to suppress background noise for arbitrary shapes. The resulting SAME-Net framework requires no character-level annotations or auxiliary rectification, supports joint optimization of detection and recognition via back-propagation, and reports 84.02% end-to-end H-mean on Total-Text (1.02% above GLASS) plus competitive 83.4% strong-lexicon results on ICDAR 2015.
Significance. If the empirical claims hold under rigorous verification, the work would be significant for simplifying scene text spotting pipelines by eliminating explicit rectification while improving performance on challenging arbitrary-shape benchmarks. The fully differentiable soft attention design enabling joint detection-recognition optimization is a clear strength, as is the reported gain without additional training data. These elements could influence future architectures if the resolution and reproducibility concerns are resolved.
major comments (2)
- [§3] §3 (SAME module): The central claim that Transformer-encoded high-level features produce soft attention weights sufficient for precise boundary refinement via hierarchical embedding rests on the assumption that global context compensates for downsampled resolution. This is load-bearing for the rectification-free assertion; without explicit mechanisms (e.g., multi-scale fusion or upsampling details) to recover fine local boundaries amid multi-scale variation, the reported gains on Total-Text may not generalize.
- [§4] §4 (Experiments): The abstract and results claim specific benchmark improvements (84.02% H-mean, +1.02% over GLASS) but provide no details on data splits, ablation studies, error bars, or run counts. This absence directly affects verification of whether the SAME module drives the gains or if post-hoc choices are involved, undermining confidence in the joint-optimization benefit.
minor comments (1)
- Notation for the hierarchical embedding step could be clarified with a diagram or pseudocode to improve readability of the mask refinement process.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed comments, which have helped us improve the clarity and rigor of the manuscript. We address each major comment point by point below, providing explanations and indicating revisions made where appropriate.
read point-by-point responses
-
Referee: [§3] §3 (SAME module): The central claim that Transformer-encoded high-level features produce soft attention weights sufficient for precise boundary refinement via hierarchical embedding rests on the assumption that global context compensates for downsampled resolution. This is load-bearing for the rectification-free assertion; without explicit mechanisms (e.g., multi-scale fusion or upsampling details) to recover fine local boundaries amid multi-scale variation, the reported gains on Total-Text may not generalize.
Authors: We appreciate the referee's emphasis on this foundational aspect of the SAME module. The Transformer encoder is applied to high-level features precisely to leverage its global receptive field for capturing long-range context, which informs the computation of soft attention weights. These weights are then hierarchically embedded with the predicted masks across multiple stages, enabling progressive boundary refinement and background suppression without requiring character-level annotations or explicit rectification. This design choice allows the global context to compensate for resolution loss from downsampling, as the attention mechanism adaptively focuses on text-relevant regions amid multi-scale and arbitrary-shape variations. To address the concern directly, we have revised §3 to include an expanded explanation of the hierarchical embedding process, a new figure illustrating the multi-stage refinement, and an ablation study isolating the Transformer's contribution. While we maintain that the current architecture suffices for the reported gains on Total-Text (as the rectification-free pipeline achieves state-of-the-art results), we have added a note in the discussion acknowledging that explicit multi-scale fusion could be explored as future work. revision: partial
-
Referee: [§4] §4 (Experiments): The abstract and results claim specific benchmark improvements (84.02% H-mean, +1.02% over GLASS) but provide no details on data splits, ablation studies, error bars, or run counts. This absence directly affects verification of whether the SAME module drives the gains or if post-hoc choices are involved, undermining confidence in the joint-optimization benefit.
Authors: We agree that additional experimental details are critical for reproducibility and to substantiate the role of the SAME module in driving the observed improvements. In the revised manuscript, we have substantially expanded §4 with the following: explicit descriptions of the standard data splits and preprocessing for Total-Text and ICDAR 2015; comprehensive ablation studies (including tables) on key components such as the Transformer encoder, soft attention weights, and hierarchical embedding, demonstrating their individual and combined contributions; results reported as mean with standard deviation over three independent runs to provide error bars; and further clarification on the back-propagation path enabling joint detection-recognition optimization. These revisions confirm that the 84.02% H-mean and the 1.02% gain over GLASS are attributable to the proposed module rather than post-hoc decisions, thereby strengthening confidence in the joint-optimization benefit. revision: yes
Circularity Check
No circularity: architecture proposal validated empirically on external benchmarks
full rationale
The paper proposes the SAME module as a differentiable architectural component that computes soft attention from Transformer-encoded features and embeds it hierarchically with mask predictions to refine boundaries. This is presented as an engineering design choice, not a derivation that reduces to its own fitted parameters or prior self-citations. The central performance claims (84.02% H-mean on Total-Text, +1.02% over GLASS) are reported as outcomes of end-to-end training and evaluation on standard public datasets, with no equations or uniqueness theorems shown that would make the reported gains tautological by construction. The module is fully differentiable by design, allowing joint optimization, but this does not create a self-definitional loop. No load-bearing self-citation chains or renamed empirical patterns are evident in the provided description.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Soft attention weights from Transformer encoders can be hierarchically embedded with masks to refine boundaries
invented entities (1)
-
Soft Attention Mask Embedding (SAME) module
no independent evidence
Reference graph
Works this paper leans on
-
[1]
A survey on table recognition technology,
L. Gao, Y . Li, L. Du, X. Zhang, Z. Zhu, N. Lu, L. Jin, Y . Huang, and Z. Tang, “A survey on table recognition technology,”Journal of Image and Graphics, vol. 27, no. 6, pp. 1898–1917, 2022
work page 1917
-
[2]
Deep learning methods for scene text detection and recognition,
C. Liu, X. Chen, C. Luo, L. Jin, Y . Xue, and Y . Liu, “Deep learning methods for scene text detection and recognition,” Journal of Image and Graphics, vol. 26, no. 6, pp. 1330– 1367, 2021
work page 2021
-
[3]
TextSquare: Scaling up text-centric visual instruction tuning,
J. Tang, C. Lin, Z. Zhao, S. Wei, B. Wu, Q. Liu, H. Feng, Y . Li, S. Wang, L. Liaoet al., “TextSquare: Scaling up text-centric visual instruction tuning,”arXiv preprint arXiv:2404.12803, 2024
-
[4]
B. Shi, X. Bai, and C. Yao, “An end-to-end trainable neural network for image-based sequence recognition and its ap- plication to scene text recognition,” vol. 39, no. 11. IEEE, 2017, pp. 2298–2304
work page 2017
-
[5]
Charac- ter region awareness for text detection,
Y . Baek, B. Lee, D. Han, S. Yun, and H. Lee, “Charac- ter region awareness for text detection,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pat- tern Recognition (CVPR). IEEE, 2019, pp. 9357–9366
work page 2019
-
[6]
J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkatet al., “GPT-4 technical report,”arXiv preprint arXiv:2303.08774, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[7]
Denoising diffusion proba- bilistic models,
J. Ho, A. Jain, and P. Abbeel, “Denoising diffusion proba- bilistic models,”Advances in Neural Information Process- ing Systems, vol. 33, pp. 6840–6851, 2020
work page 2020
-
[8]
High-resolution image synthesis with la- tent diffusion models,
R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer, “High-resolution image synthesis with la- tent diffusion models,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 10 684–10 695
work page 2022
-
[9]
SETrans- former: A hybrid attention-based architecture for robust human activity recognition,
Y . Liu, X. Qin, Y . Gao, X. Li, and C. Feng, “SETrans- former: A hybrid attention-based architecture for robust human activity recognition,”INNO-PRESS: Journal of Emerging Applied AI, vol. 1, no. 1, 2025
work page 2025
-
[10]
A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” inAdvances in Neural Information Processing Systems (NeurIPS), vol. 30, 2017
work page 2017
-
[11]
End-to-end object detection with transformers,
N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kir- illov, and S. Zagoruyko, “End-to-end object detection with transformers,” inProceedings of the European Conference on Computer Vision (ECCV). Springer, 2020, pp. 213– 229
work page 2020
-
[12]
End-to-end scene text recognition,
K. Wang, B. Babenko, and S. Belongie, “End-to-end scene text recognition,” inProceedings of the International Con- ference on Computer Vision (ICCV). IEEE, 2011, pp. 1457–1464
work page 2011
-
[13]
Pho- toocr: Reading text in uncontrolled conditions,
A. Bissacco, M. Cummins, Y . Netzer, and H. Neven, “Pho- toocr: Reading text in uncontrolled conditions,” inPro- ceedings of the IEEE International Conference on Com- puter Vision (ICCV). IEEE, 2013, pp. 785–792
work page 2013
-
[14]
Textboxes: A fast text detector with a single deep neural network,
M. Liao, B. Shi, X. Bai, X. Wang, and W. Liu, “Textboxes: A fast text detector with a single deep neural network,” inProceedings of the 31st AAAI Conference on Artificial Intelligence. AAAI Press, 2017, pp. 4161–4167
work page 2017
-
[15]
Towards end-to-end text spotting with convolutional recurrent neural networks,
H. Li, P. Wang, and C. Shen, “Towards end-to-end text spotting with convolutional recurrent neural networks,” inProceedings of the IEEE International Conference on Computer Vision (ICCV). IEEE, 2017, pp. 5248–5256
work page 2017
-
[16]
Fots: Fast oriented text spotting with a unified network,
X. Liu, D. Liang, S. Yan, D. Chen, Y . Qiao, and J. Yan, “Fots: Fast oriented text spotting with a unified network,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2018, pp. 5676–5685
work page 2018
-
[17]
To- wards unconstrained end-to-end text spotting,
S. Qin, A. Bissaco, M. Raptis, Y . Fujii, and Y . Xiao, “To- wards unconstrained end-to-end text spotting,” inProceed- ings of the IEEE/CVF International Conference on Com- puter Vision (ICCV). IEEE, 2019, pp. 4703–4713
work page 2019
-
[18]
Dol- phin: Document image parsing via heterogeneous anchor prompting,
H. Feng, S. Wei, X. Fei, W. Shi, Y . Han, L. Liao, J. Lu, B. Wu, Q. Liu, C. Lin, J. Tang, and C. Huang, “Dol- phin: Document image parsing via heterogeneous anchor prompting,” inFindings of the Association for Computa- tional Linguistics: ACL 2025, 2025, pp. 21 919–21 936
work page 2025
-
[19]
H. Feng, Z. Wang, J. Tang, J. Lu, W. Zhou, H. Li, and C. Huang, “UniDoc: A universal large multimodal model for simultaneous text detection, recognition, spotting and understanding,”arXiv preprint arXiv:2308.11592, 2023
-
[20]
M. Huang, Y . Liu, Z. Peng, C. Liu, D. Lin, S. Zhu, N. Yuan, K. Ding, and L. Jin, “Swintextspotter: Scene text spotting via better synergy between text detection and text recog- nition,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2022, pp. 4583–4593
work page 2022
-
[21]
X. Zhang, Y . Su, S. Tripathi, and Z. Tu, “Text spotting transformers,” inProceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition (CVPR). IEEE, 2022, pp. 9509–9518
work page 2022
-
[22]
Deepsolo: Let transformer decoder with explicit points solo for text spotting,
M. Ye, J. Zhang, S. Zhao, J. Liu, T. Liu, B. Du, and D. Tao, “Deepsolo: Let transformer decoder with explicit points solo for text spotting,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2023, pp. 19 348–19 357
work page 2023
-
[23]
You Only Look Once: Unified, Real-Time Object Detection
J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You only look once: Unified, real-time object detection,”arXiv preprint arXiv:1506.02640, 2016
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[24]
Trends and prospects of techniques for haze removal from degraded images: A survey,
G. Sahu, A. Seal, D. Bhattacharjee, M. Nasipuri, P. Brida, and O. Krejcar, “Trends and prospects of techniques for haze removal from degraded images: A survey,”IEEE Transactions on Emerging Topics in Computational Intel- ligence, vol. 6, no. 4, pp. 762–782, 2022
work page 2022
-
[25]
Mask textspotter: An end-to-end trainable neural network for 8 spotting text with arbitrary shapes,
P. Lyu, M. Liao, C. Yao, W. Wu, and X. Bai, “Mask textspotter: An end-to-end trainable neural network for 8 spotting text with arbitrary shapes,” inProceedings of the 15th European Conference on Computer Vision (ECCV). Springer, 2018, pp. 71–88
work page 2018
-
[26]
Mask textspotter v3: Segmentation proposal network for robust scene text spotting,
M. Liao, G. Pang, J. Huang, T. Hassner, and X. Bai, “Mask textspotter v3: Segmentation proposal network for robust scene text spotting,” inProceedings of the 16th European Conference on Computer Vision (ECCV). Springer, 2020, pp. 706–722
work page 2020
-
[27]
Textsnake: A flexible representation for detecting text of arbitrary shapes,
S. Long, J. Ruan, W. Zhang, X. He, W. Wu, and C. Yao, “Textsnake: A flexible representation for detecting text of arbitrary shapes,” inProceedings of the European Confer- ence on Computer Vision (ECCV). Springer, 2018, pp. 19–35
work page 2018
-
[28]
Few could be better than all: Feature sampling and grouping for scene text detection,
J. Tang, W. Zhang, H. Liu, M. Yang, B. Jiang, G. Hu, and X. Bai, “Few could be better than all: Feature sampling and grouping for scene text detection,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pat- tern Recognition, 2022, pp. 4563–4572
work page 2022
-
[29]
You can even annotate text with voice: Transcription- only-supervised text spotting,
J. Tang, S. Qiao, B. Cui, Y . Ma, S. Zhang, and D. Kanoulas, “You can even annotate text with voice: Transcription- only-supervised text spotting,” inProceedings of the 30th ACM International Conference on Multimedia, 2022, pp. 4154–4163
work page 2022
-
[30]
H. Zhong, J. Tang, W. Wang, Z. Yang, C. Yao, and T. Lu, “Arts: Eliminating inconsistency between text detection and recognition with auto-rectification text spotter,”arXiv preprint arXiv:2110.10405, 2021
-
[31]
Abcnet v2: Adaptive bezier-curve network for real-time end-to-end text spotting,
Y . Liu, C. Shen, L. Jin, T. He, P. Chen, C. Liu, and H. Chen, “Abcnet v2: Adaptive bezier-curve network for real-time end-to-end text spotting,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 44, no. 11, pp. 8048–8064, 2022
work page 2022
-
[32]
H. Feng, Q. Liu, H. Liu, J. Tang, W. Zhou, H. Li, and C. Huang, “Docpedia: Unleashing the power of large mul- timodal model in the frequency domain for versatile docu- ment understanding,”Science China Information Sciences, vol. 67, no. 12, pp. 1–14, 2024
work page 2024
-
[33]
Multi-modal in-context learning makes an ego-evolving scene text recognizer,
Z. Zhao, J. Tang, C. Lin, B. Wu, C. Huang, H. Liu, X. Tan, Z. Zhang, and Y . Xie, “Multi-modal in-context learning makes an ego-evolving scene text recognizer,” inProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 15 567–15 576
work page 2024
-
[34]
Harmonizing visual text comprehension and generation,
Z. Zhao, J. Tang, B. Wu, C. Lin, S. Wei, H. Liu, X. Tan, Z. Zhang, C. Huang, and Y . Xie, “Harmonizing visual text comprehension and generation,”arXiv preprint arXiv:2407.16364, 2024
-
[35]
Estextspotter: Towards better scene text spotting with explicit synergy in transformer,
M. Huang, J. Zhang, D. Peng, H. Lu, C. Huang, Y . Liu, X. Bai, and L. Jin, “Estextspotter: Towards better scene text spotting with explicit synergy in transformer,” inPro- ceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). IEEE, 2023, pp. 19 495–19 505
work page 2023
-
[36]
Feature pyramid networks for object detec- tion,
T.-Y . Lin, P. Doll´ar, R. Girshick, K. He, B. Hariharan, and S. Belongie, “Feature pyramid networks for object detec- tion,” inProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2017, pp. 936–944
work page 2017
-
[37]
Mtvqa: Benchmarking multilingual text-centric visual question answering
J. Tang, Q. Liu, Y . Ye, J. Lu, S. Wei, C. Lin, W. Li, M. F. F. B. Mahmood, H. Feng, Z. Zhaoet al., “MTVQA: Benchmarking multilingual text-centric visual question answering,”arXiv preprint arXiv:2405.11985, 2024
-
[38]
MCTBench: Multi- modal cognition towards text-rich visual scenes bench- mark,
B. Shan, X. Fei, W. Shi, A. Wang, G. Tang, L. Liao, J. Tang, X. Bai, and C. Huang, “MCTBench: Multi- modal cognition towards text-rich visual scenes bench- mark,”arXiv preprint arXiv:2410.11538, 2024
-
[39]
SPTS v2: Single-point scene text spotting,
Y . Liu, J. Zhang, D. Peng, M. Huang, X. Wang, J. Tang, C. Huang, D. Lin, C. Shen, X. Bai, and L. Jin, “SPTS v2: Single-point scene text spotting,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 45, no. 12, pp. 15 038–15 055, 2023
work page 2023
-
[40]
J. Tang, W. Qian, L. Song, X. Dong, L. Li, and X. Bai, “Optimal boxes: Boosting end-to-end scene text recogni- tion by adjusting annotated bounding boxes via reinforce- ment learning,” inEuropean Conference on Computer Vi- sion. Springer, 2022, pp. 233–248
work page 2022
-
[41]
J. Lu, H. Yu, Y . Wang, Y . Ye, J. Tang, Z. Yang, B. Wu, Q. Liu, H. Feng, H. Wanget al., “A bounding box is worth one token: Interleaving layout and text in a large language model for document understanding,” inFindings of the As- sociation for Computational Linguistics: ACL 2025, 2025, pp. 7252–7273
work page 2025
-
[42]
A. Wang, J. Tang, L. Liao, H. Feng, Q. Liu, X. Fei, J. Lu, H. Wang, H. Liu, Y . Liuet al., “WildDoc: How far are we from achieving comprehensive and robust document un- derstanding in the wild?” inProceedings of the 2025 Con- ference on Empirical Methods in Natural Language Pro- cessing, 2025
work page 2025
-
[43]
H. Wang, Y . Ye, B. Li, Y . Nie, J. Lu, J. Tang, Y . Wang, and C. Huang, “Vision as LoRA,”arXiv preprint arXiv:2503.20680, 2025
-
[44]
L. Fu, Z. Kuang, J. Song, M. Huang, B. Yang, Y . Li, L. Zhu, Q. Luo, X. Wang, H. Lu, G. Tang, B. Shan, C. Lin, Q. Liu, B. Wu, H. Feng, H. Liu, C. Huang, J. Tang, W. Chen, L. Jin, Y . Liu, and X. Bai, “OCRBench v2: An improved benchmark for evaluating large multimodal models on visual text localization and reasoning,”arXiv preprint arXiv:2501.00321, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[45]
Character recognition competi- tion for street view shop signs,
J. Tang, W. Du, B. Wang, W. Zhou, S. Mei, T. Xue, X. Xu, and H. Zhang, “Character recognition competi- tion for street view shop signs,”National Science Review, vol. 10, no. 6, p. nwad141, 2023
work page 2023
-
[46]
Benchmark- ing vision-language models on chinese ancient documents: From OCR to knowledge reasoning,
H. Yu, Y . Wu, F. Shi, L. Liao, J. Lu, X. Ge, H. Wang, M. Zhuo, X. Wu, X. Fei, J. Tanget al., “Benchmark- ing vision-language models on chinese ancient documents: From OCR to knowledge reasoning,”arXiv preprint arXiv:2509.09731, 2025
-
[47]
Pargo: Bridging vision-language with partial and global views,
A.-L. Wang, B. Shan, W. Shi, K.-Y . Lin, X. Fei, G. Tang, L. Liao, J. Tang, C. Huang, and W.-S. Zheng, “Pargo: Bridging vision-language with partial and global views,” vol. 39, no. 7, pp. 7491–7499, 2025
work page 2025
-
[48]
Textdragon: An end-to-end framework for arbitrary 9 shaped text spotting,
W. Feng, W. He, F. Yin, X.-Y . Zhang, and C.-L. Liu, “Textdragon: An end-to-end framework for arbitrary 9 shaped text spotting,” inProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). IEEE, 2019, pp. 9075–9084
work page 2019
-
[49]
Abc- net: Real-time scene text spotting with adaptive bezier- curve network,
Y . Liu, H. Chen, C. Shen, T. He, L. Jin, and L. Wang, “Abc- net: Real-time scene text spotting with adaptive bezier- curve network,” inProceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition (CVPR). IEEE, 2020, pp. 9806–9815
work page 2020
-
[50]
MEML-GRPO: Heterogeneous multi-expert mutual learning for RLVR advancement,
W. Jia, J. Lu, H. Yu, S. Wang, G. Tang, A. Wang, W. Yin, D. Yang, Y . Nieet al., “MEML-GRPO: Heterogeneous multi-expert mutual learning for RLVR advancement,” in Proceedings of the AAAI Conference on Artificial Intelli- gence, vol. 40, no. 37, 2026, p. 31283
work page 2026
-
[51]
TC-Pad ´e: Trajectory- consistent Pad´e approximation for diffusion acceleration,
B. Cui, S. He, B. Huang, Z. Ye, Y . Sun, L. Huang, H. Xue, Y . Yang, J. Tanget al., “TC-Pad ´e: Trajectory- consistent Pad´e approximation for diffusion acceleration,” arXiv preprint arXiv:2603.02943, 2026
-
[52]
An end-to-end textspotter with explicit alignment and at- tention,
T. He, Z. Tian, W. Huang, C. Shen, Y . Qiao, and C. Sun, “An end-to-end textspotter with explicit alignment and at- tention,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2018, pp. 5020–5029
work page 2018
-
[53]
Advancing sequential numerical prediction in autoregressive models,
X. Fei, J. Lu, Q. Sun, H. Feng, Y . Wang, W. Shi, A. Wang, J. Tang, and C. Huang, “Advancing sequential numerical prediction in autoregressive models,” inProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics, 2025
work page 2025
-
[54]
J. Lu, H. Yu, S. Xu, S. Ran, G. Tang, S. Wang, B. Shan, T. Fu, H. Feng, J. Tanget al., “Prolonged reason- ing is not all you need: Certainty-based adaptive rout- ing for efficient LLM/MLLM reasoning,”arXiv preprint arXiv:2505.15154, 2025
-
[55]
W. Sun, X.-M. Dong, B. Cui, and J. Tang, “Attentive eraser: Unleashing diffusion model’s object removal po- tential via self-attention redirection guidance,” vol. 39, no. 19, pp. 20 734–20 742, 2025
work page 2025
-
[56]
Real-time scene text detection with differentiable binarization and adaptive scale fusion,
M. Liao, Z. Zou, Z. Wan, C. Yao, and X. Bai, “Real-time scene text detection with differentiable binarization and adaptive scale fusion,” vol. 45, no. 1. IEEE, 2023, pp. 919–931
work page 2023
-
[57]
TabPedia: Towards comprehensive visual table understanding with concept synergy,
W. Zhao, H. Feng, Q. Liu, J. Tang, S. Wei, B. Wu, L. Liao, Y . Ye, H. Liu, W. Zhou, H. Li, and C. Huang, “TabPedia: Towards comprehensive visual table understanding with concept synergy,” inAdvances in Neural Information Pro- cessing Systems, vol. 37, 2024
work page 2024
-
[58]
Glass: Global to local attention for scene- text spotting,
R. Ronen, S. Tsiper, O. Anschel, I. Lavi, A. Markovitz, and R. Manmatha, “Glass: Global to local attention for scene- text spotting,”arXiv preprint arXiv:2208.03364, 2022
-
[59]
Dolphin-v2: Universal document parsing via scalable anchor prompting,
H. Feng, W. Shi, K. Zhang, X. Fei, L. Liao, D. Yang, Y . Du, X. Wu, J. Tang, Y . Liu, and X. Bai, “Dolphin-v2: Universal document parsing via scalable anchor prompting,” 2026
work page 2026
-
[60]
Resolving evidence sparsity: Agentic context engineer- ing for long-document understanding,
K. Liu, Z. Chen, M. Li, J. Tang, D. Yang, and L. Zhang, “Resolving evidence sparsity: Agentic context engineer- ing for long-document understanding,”arXiv preprint arXiv:2511.22850, 2025
-
[61]
Mindev: Multi-modal integrated diffusion framework for video reconstruction from eeg signals,
S. Huang, Y . Wang, H. Luo, H. Jing, C. Qin, and J. Tang, “Mindev: Multi-modal integrated diffusion framework for video reconstruction from eeg signals,” pp. 3350–3359, 2025
work page 2025
-
[62]
Swin transformer: Hierarchical vision trans- former using shifted windows,
Z. Liu, Y . Lin, Y . Cao, H. Hu, Y . Wei, Z. Zhang, S. Lin, and B. Guo, “Swin transformer: Hierarchical vision trans- former using shifted windows,” inProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). IEEE, 2021, pp. 9992–10 002
work page 2021
-
[63]
Deep residual learn- ing for image recognition,
K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learn- ing for image recognition,” inProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2016, pp. 770–778
work page 2016
-
[64]
Focal loss for dense object detection,
T.-Y . Lin, P. Goyal, R. Girshick, K. He, and P. Doll ´ar, “Focal loss for dense object detection,” inProceedings of the IEEE International Conference on Computer Vision (ICCV). IEEE, 2017, pp. 2999–3007
work page 2017
-
[65]
Generalized intersection over union: A metric and a loss for bounding box regression,
H. Rezatofighi, N. Tsoi, J. Gwak, A. Sadeghian, I. Reid, and S. Savarese, “Generalized intersection over union: A metric and a loss for bounding box regression,” inProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2019, pp. 658–666
work page 2019
-
[66]
Istr: End-to-end instance segmenta- tion with transformers,
J. Hu, L. Cao, Y . Lu, S. Zhang, Y . Wang, K. Li, F. Huang, L. Shao, and R. Ji, “Istr: End-to-end instance segmenta- tion with transformers,”arXiv preprint arXiv:2105.00637, 2021
-
[67]
Synthetic Data for Text Localisation in Natural Images
A. Gupta, A. Vedaldi, and A. Zisserman, “Synthetic data for text localization in natural images,”arXiv preprint arXiv:1604.06646, 2016
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[68]
Icdar 2015 competition on robust reading,
D. Karatzas, L. Gomez-Bigorda, A. Nicolaou, S. Ghosh, A. Bagdanov, M. Iwamura, J. Matas, L. Neumann, V . R. Chandrasekhar, S. Lu, F. Shafait, S. Uchida, and E. Val- veny, “Icdar 2015 competition on robust reading,” inPro- ceedings of the 13th International Conference on Docu- ment Analysis and Recognition (ICDAR). IEEE, 2015, pp. 1156–1160
work page 2015
-
[69]
Total-text: A comprehen- sive dataset for scene text detection and recognition,
C. K. Ch’ng and C. S. Chan, “Total-text: A comprehen- sive dataset for scene text detection and recognition,” in Proceedings of the 14th IAPR International Conference on Document Analysis and Recognition (ICDAR). IEEE, 2017, pp. 935–942
work page 2017
-
[70]
Mango: A mask attention guided one-stage scene text spotter,
L. Qiao, Y . Chen, Z. Cheng, Y . Xu, Y . Niu, S. Pu, and F. Wu, “Mango: A mask attention guided one-stage scene text spotter,”arXiv preprint arXiv:2012.04350, 2021
-
[71]
All you need is boundary: Toward arbitrary-shaped text spotting,
H. Wang, P. Lu, H. Zhang, M. Yang, X. Bai, Y . Xu, M. He, Y . Wang, and W. Liu, “All you need is boundary: Toward arbitrary-shaped text spotting,” inProceedings of the 34th AAAI Conference on Artificial Intelligence. AAAI Press, 2020, pp. 12 160–12 167
work page 2020
-
[72]
Pgnet: Real-time arbitrarily- shaped text spotting with point gathering network,
P. Wang, C. Zhang, F. Qi, S. Liu, X. Zhang, P. Lyu, J. Han, J. Liu, E. Ding, and G. Shi, “Pgnet: Real-time arbitrarily- shaped text spotting with point gathering network,”arXiv preprint arXiv:2104.05458, 2021
-
[73]
Spts: Single-point text spotting,
D. Du, X. Chen, J. Peng, J. Liu, D. Peng, and L. Jin, “Spts: Single-point text spotting,” inProceedings of the 30th 10 ACM International Conference on Multimedia. ACM, 2022, pp. 4272–4281
work page 2022
-
[74]
Pan++: Towards efficient and accurate end- to-end spotting of arbitrarily-shaped text,
W. Wang, E. Xie, X. Li, X. Liu, D. Liang, Z. Yang, T. Lu, and C. Shen, “Pan++: Towards efficient and accurate end- to-end spotting of arbitrarily-shaped text,”IEEE Trans- actions on Pattern Analysis and Machine Intelligence, vol. 44, no. 9, pp. 5349–5367, 2022. 11
work page 2022
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.