TraRA: Trajectory-level Recognition Aggregation for Video Text Spotting in Urban Surveillance

Duc Tri Tran; Phi Le Nguyen; Trung Thanh Nguyen; Vijay John; Yasutomo Kawanishi

arxiv: 2606.07161 · v1 · pith:3R4DEYLTnew · submitted 2026-06-05 · 💻 cs.CV

TraRA: Trajectory-level Recognition Aggregation for Video Text Spotting in Urban Surveillance

Duc Tri Tran , Trung Thanh Nguyen , Vijay John , Phi Le Nguyen , Yasutomo Kawanishi This is my paper

Pith reviewed 2026-06-27 22:09 UTC · model grok-4.3

classification 💻 cs.CV

keywords video text spottingtrajectory aggregationtemporal clusteringvision-language modelurban surveillancetext recognitionVTSmotion blur

0 comments

The pith

Aggregating text recognition over full trajectories improves spotting accuracy in surveillance videos despite blur and occlusion.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to establish that shifting from independent per-frame recognition to trajectory-level aggregation yields more consistent and accurate text spotting in dynamic video streams. TraRA achieves this through two modules that refine trajectories by grouping coherent instances temporally and visually, then fuse visual features with linguistic context via a low-rank adapted vision-language model. A sympathetic reader would care because video text spotting underpins automated monitoring in transportation and security, yet current methods produce erratic outputs when motion blur, occlusion, or scale changes appear. The authors report that this aggregation approach lifts both tracking and recognition metrics across four standard benchmarks.

Core claim

TraRA is a plug-and-play module that performs text recognition at the trajectory level by first using Temporal Clustering to group temporally and visually coherent text instances and then applying Vision-Language Aggregation, powered by a Low-Rank Adaptation-enhanced vision-language model, to combine visual cues and linguistic context across all frames in the trajectory, thereby delivering robust recognition under the motion blur, occlusion, and scale variation typical of urban surveillance video.

What carries the argument

Trajectory-level aggregation realized by Temporal Clustering for grouping coherent instances and Low-Rank Adaptation-enhanced Vision-Language Aggregation for cross-frame multimodal fusion.

If this is right

Tracking and recognition metrics improve over prior state-of-the-art methods on RoadText, BOVText, ArTVideo, and ICDAR15.
The approach functions as a drop-in addition to existing video text spotting pipelines.
Performance gains arise specifically from exploiting both temporal consistency and multimodal vision-language information.
The method mitigates the inconsistency caused by frame-independent recognition under surveillance conditions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same trajectory-aggregation pattern could be tested on other sequential recognition problems such as license-plate reading or scene-text translation in video.
Replacing the low-rank adaptation with full fine-tuning or newer vision-language backbones might produce further gains, though at higher compute cost.
In live surveillance deployments the added latency of clustering and aggregation would need direct measurement against the observed accuracy benefit.

Load-bearing premise

The temporal clustering step reliably isolates coherent text instances and the vision-language fusion step measurably improves recognition when cues are combined across frames.

What would settle it

An ablation study on the four benchmarks showing that disabling either the Temporal Clustering or the Vision-Language Aggregation module produces no statistically significant gain over existing frame-level video text spotting baselines.

Figures

Figures reproduced from arXiv: 2606.07161 by Duc Tri Tran, Phi Le Nguyen, Trung Thanh Nguyen, Vijay John, Yasutomo Kawanishi.

**Figure 2.** Figure 2: Overview of the proposed TraRA and its Vision-Language Model fine-tuning strategy for video text spotting. [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Qualitative comparison between the proposed TraRA [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

read the original abstract

Video Text Spotting (VTS) is essential for urban surveillance and intelligent transportation systems, enabling automated reading of street signs, vehicle markings, and scene text in video streams. However, reliable recognition remains challenging due to dynamic video factors common in surveillance scenarios, including motion blur, occlusion, and scale variation, which degrade frame-level recognition. Existing VTS methods typically perform recognition independently on each frame, leading to inconsistent and inaccurate results across sequences. To address these limitations, we propose TraRA (Trajectory-level Recognition Aggregation for VTS), a plug-and-play method that performs trajectory-level text recognition by leveraging temporal and multimodal consistency. TraRA integrates two key modules: (1) the Temporal Clustering and (2) the Vision-Language Aggregation. The former refines noisy trajectories by grouping temporally and visually coherent text instances, while the latter employs a Low-Rank Adaptation-enhanced Vision-Language model to fuse visual cues with linguistic context across frames. By aggregating information over entire text trajectories, TraRA achieves robust text recognition even under challenging surveillance conditions. Extensive experiments on four public benchmarks, including road and urban scene datasets (RoadText, BOVText, ArTVideo, and ICDAR15), demonstrate that TraRA consistently improves tracking and recognition performance over state-of-the-art VTS methods. The source code is available at https://github.com/trid2912/TraRA.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

TraRA adds temporal clustering and LoRA-based vision-language fusion for trajectory-level video text spotting, but without ablations the gains cannot be attributed to those modules rather than the base model.

read the letter

The one thing to take away is that TraRA reframes video text spotting as a trajectory-level task instead of independent per-frame recognition. It uses a temporal clustering step to clean up noisy tracks and a LoRA-tuned vision-language model to combine visual and linguistic cues across frames, then reports better numbers on RoadText, BOVText, ArTVideo, and ICDAR15.

What is actually new is the explicit combination of trajectory refinement with multimodal aggregation in this surveillance setting, plus the plug-and-play claim and the released code. The paper does a service by testing on four standard benchmarks and making the implementation public, which lowers the barrier for follow-up work.

The soft spot is exactly the one flagged in the stress-test note. The abstract presents the two modules as the reason for the improvements, yet there are no module-level ablations, no clustering-quality metrics, and no controlled comparisons that hold the base VLM fixed. Without those, it is impossible to know whether the trajectory premise is doing the work or whether the gains are simply from swapping in a stronger recognition backbone. That gap is material, not minor, because the central claim depends on the modules functioning as described under motion blur and occlusion.

This paper is for computer-vision groups that build or deploy video text systems for traffic or surveillance. A reader who needs a new baseline or open code to adapt would get practical value; someone looking for a rigorously isolated contribution would be disappointed.

It deserves peer review. The idea is coherent, the datasets are appropriate, and the code release is real evidence of reproducibility. Referees can request the missing ablations without the paper being fundamentally broken.

Referee Report

1 major / 1 minor

Summary. The paper proposes TraRA, a plug-and-play method for video text spotting (VTS) that shifts from frame-independent recognition to trajectory-level aggregation. It introduces a Temporal Clustering module to refine noisy trajectories by grouping temporally and visually coherent instances, and a Vision-Language Aggregation module that uses a LoRA-enhanced vision-language model to fuse visual cues with linguistic context across frames. The approach is evaluated on four benchmarks (RoadText, BOVText, ArTVideo, ICDAR15), with claims of consistent improvements in tracking and recognition over state-of-the-art VTS methods under surveillance conditions such as motion blur and occlusion. Source code is provided via GitHub.

Significance. If the trajectory-level premise holds, TraRA could meaningfully advance robust VTS for urban surveillance and intelligent transportation by exploiting temporal and multimodal consistency. The availability of source code is a clear strength that supports reproducibility and future extensions.

major comments (1)

[Experiments] The central claim requires that the Temporal Clustering module reliably produces coherent trajectories and the Vision-Language Aggregation module produces useful multimodal fusion that improves recognition under surveillance degradations. However, the manuscript reports only end-to-end benchmark gains on RoadText/BOVText/ArTVideo/ICDAR15 without module-level ablations, clustering-quality metrics, or controlled comparisons that isolate the aggregation effect from the LoRA-enhanced VLM itself (see abstract and experiments description).

minor comments (1)

[Abstract] The abstract would benefit from quantitative statements on the magnitude of reported improvements rather than the qualitative claim of 'consistent improvements.'

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback highlighting the need for stronger module-level validation. We address the major comment below and will revise the manuscript accordingly.

read point-by-point responses

Referee: [Experiments] The central claim requires that the Temporal Clustering module reliably produces coherent trajectories and the Vision-Language Aggregation module produces useful multimodal fusion that improves recognition under surveillance degradations. However, the manuscript reports only end-to-end benchmark gains on RoadText/BOVText/ArTVideo/ICDAR15 without module-level ablations, clustering-quality metrics, or controlled comparisons that isolate the aggregation effect from the LoRA-enhanced VLM itself (see abstract and experiments description).

Authors: We agree that the current experiments section focuses on end-to-end benchmark results and does not include dedicated module ablations or isolation studies. To strengthen the validation of the central claim, we will add: (1) quantitative clustering-quality metrics (e.g., temporal coherence and visual similarity scores) for the Temporal Clustering module, (2) ablation tables removing or replacing the Vision-Language Aggregation module, and (3) controlled comparisons of the LoRA-enhanced VLM against its base version within the aggregation pipeline. These additions will appear in a new subsection of the experiments and will be supported by the already-released source code. revision: yes

Circularity Check

0 steps flagged

No circularity; method is architectural with no derivations or fitted predictions.

full rationale

The manuscript describes a plug-and-play architecture (Temporal Clustering + LoRA-enhanced Vision-Language Aggregation) and reports end-to-end benchmark gains. No equations, parameter-fitting steps, uniqueness theorems, or self-citations are presented that would reduce any claimed result to its own inputs by construction. The central claim rests on empirical improvements rather than any self-referential derivation chain.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review contains no mathematical content, parameters, axioms, or new entities.

pith-pipeline@v0.9.1-grok · 5787 in / 1016 out tokens · 24526 ms · 2026-06-27T22:09:29.026146+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

32 extracted references · 2 linked inside Pith

[1]

Scene text recognition for text-based traffic signs,

Y . Taki and E. Zemmouri, “Scene text recognition for text-based traffic signs,” inAdvances in Intelligent Traffic and Transportation Systems, 2023, pp. 67–77

2023
[2]

Scene text detection and recognition: The deep learning era,

S. Long, X. He, and C. Yao, “Scene text detection and recognition: The deep learning era,”International Journal of Computer Vision, vol. 129, no. 1, pp. 161–184, 2021

2021
[3]

End- to-end video text spotting with transformer,

W. Wu, Y . Cai, C. Shen, D. Zhang, Y . Fu, H. Zhou, and P. Luo, “End- to-end video text spotting with transformer,”International Journal of Computer Vision, vol. 132, no. 9, pp. 4019–4035, 2024

2024
[4]

GoMatching: A simple baseline for video text spotting via long and short term matching,

H. He, M. Ye, J. Zhang, J. Liu, B. Du, and D. Tao, “GoMatching: A simple baseline for video text spotting via long and short term matching,” Advances in Neural Information Processing Systems, vol. 37, pp. 25 663– 25 686, 2024

2024
[5]

GoMatching++: Parameter-and data-efficient arbitrary-shaped video text spotting and benchmarking,

H. He, J. Zhang, M. Ye, J. Liu, B. Du, and D. Tao, “GoMatching++: Parameter-and data-efficient arbitrary-shaped video text spotting and benchmarking,”arXiv preprint arXiv:2505.22228, 2025

arXiv 2025
[6]

DeepSolo: Let transformer decoder with explicit points solo for text spotting,

M. Ye, J. Zhang, S. Zhao, J. Liu, T. Liu, B. Du, and D. Tao, “DeepSolo: Let transformer decoder with explicit points solo for text spotting,” in Proc. of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 19 348–19 357

2023
[7]

Scene text recognition with permuted au- toregressive sequence models,

D. Bautista and R. Atienza, “Scene text recognition with permuted au- toregressive sequence models,” inProc. of the 14th European Conference on Computer Vision, 2022

2022
[8]

SVTR: Scene text recognition with a single visual model,

Y . Du, Z. Chen, C. Jia, X. Yin, T. Zheng, C. Li, Y . Du, and Y .-G. Jiang, “SVTR: Scene text recognition with a single visual model,” inProc. of the 31st International Joint Conference on Artificial Intelligence, 2022

2022
[9]

A bilingual, openworld video text dataset and end-to-end video text spotter with transformer,

W. Wu, Y . Cai, D. Zhang, S. Wang, Z. Li, J. Li, Y . Tang, and H. Zhou, “A bilingual, openworld video text dataset and end-to-end video text spotter with transformer,”Advances in Neural Information Processing Systems, vol. 34, 2021

2021
[10]

DSText V2: A comprehensive video text spotting dataset for dense and small text,

W. Wu, Y . Zhang, Y . He, L. Zhang, Z. Lou, H. Zhou, and X. Bai, “DSText V2: A comprehensive video text spotting dataset for dense and small text,”Pattern Recognition, vol. 149, p. 110177, 2024

2024
[11]

ICDAR 2015 competition on robust reading,

D. Karatzas, L. Gomez-Bigorda, A. Nicolaou, S. Ghosh, A. Bagdanov, M. Iwamura, J. Matas, L. Neumann, V . R. Chandrasekhar, S. Luet al., “ICDAR 2015 competition on robust reading,” inProc. of the 13th International Conference on Document Analysis and Recognition, 2015, pp. 1156–1160

2015
[12]

Textssr: Diffusion-based data synthesis for scene text recognition,

X. Ye, Y . Du, Y . Tao, and Z. Chen, “Textssr: Diffusion-based data synthesis for scene text recognition,” inProc. of the 2025 IEEE/CVF International Conference on Computer Vision, 2025, pp. 17 464–17 473

2025
[13]

LoRA: Low-rank adaptation of large language models,

E. J. Hu, Y . Shen, P. Wallis, Z. Allen-Zhu, Y . Li, S. Wang, L. Wang, and W. Chen, “LoRA: Low-rank adaptation of large language models,” in Proc. of the 10th International Conference on Learning Representations, 2022

2022
[14]

Q-Adapter: Visual query adapter for extracting textually-related features in video captioning,

J. Chen, T. T. Nguyen, T. Komamizu, and I. Ide, “Q-Adapter: Visual query adapter for extracting textually-related features in video captioning,” inProceedings of the 7th ACM International Conference on Multimedia in Asia, 2025

2025
[15]

An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition,

B. Shi, X. Bai, and C. Yao, “An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 39, pp. 2298–2304, 2015

2015
[16]

Read like humans: Autonomous, bidirectional and iterative language modeling for scene text recognition,

S. Fang, H. Xie, Y . Wang, Z. Mao, and Y . Zhang, “Read like humans: Autonomous, bidirectional and iterative language modeling for scene text recognition,” inProc. of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 7094–7103

2021
[17]

SVIPTR: Fast and efficient scene text recognition with vision permutable extractor,

X. Cheng, W. Zhou, X. Li, X. Chen, J. Yang, T. Li, and Z. Li, “SVIPTR: Fast and efficient scene text recognition with vision permutable extractor,” inProc. of the 33rd ACM International Conference on Information and Knowledge Management, 2024

2024
[18]

DiffusionSTR: Diffusion model for scene text recognition,

M. Fujitake, “DiffusionSTR: Diffusion model for scene text recognition,” inProc. of the 2023 IEEE International Conference on Image Processing, 2023, pp. 1585–1589

2023
[19]

Mask textspotter v3: Segmentation proposal network for robust scene text spotting,

M. Liao, G. Pang, J. Huang, T. Hassner, and X. Bai, “Mask textspotter v3: Segmentation proposal network for robust scene text spotting,” in Proc. of the 16th European conference on computer vision, 2020, pp. 706–722

2020
[20]

Swintextspotter: Scene text spotting via better synergy between text detection and text recognition,

M. Huang, Y . Liu, Z. Peng, C. Liu, D. Lin, S. Zhu, N. Yuan, K. Ding, and L. Jin, “Swintextspotter: Scene text spotting via better synergy between text detection and text recognition,” inProc. of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 4593–4603

2022
[21]

FREE: A fast and robust end-to-end video text spotter,

Z. Cheng, J. Lu, B. Zou, L. Qiao, Y . Xu, S. Pu, Y . Niu, F. Wu, and S. Zhou, “FREE: A fast and robust end-to-end video text spotter,”IEEE Transactions on Image Processing, vol. 30, pp. 822–837, 2020

2020
[22]

CLIP4STR: A simple baseline for scene text recognition with pre-trained vision-language model,

S. Zhao, X. Wang, L. Zhu, and Y . Yang, “CLIP4STR: A simple baseline for scene text recognition with pre-trained vision-language model,”IEEE Transactions on Image Processing, vol. 33, pp. 6893–6904, 2023

2023
[23]

Learning transferable visual models from natural language supervision,

A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever, “Learning transferable visual models from natural language supervision,” inProc. of the 38th International Conference on Machine Learning, 2021

2021
[24]

Nougat: Neural optical understanding for academic documents,

L. Blecher, G. Cucurull, T. Scialom, and R. Stojnic, “Nougat: Neural optical understanding for academic documents,” inProc. of the 2024 International Conference on Learning Representations, 2024

2024
[25]

Ovis2.5 technical report,

S. Lu, Y . Li, Y . Xia, Y . Hu, S. Zhao, Y . Ma, Z. Wei, Y . Li, L. Duan, J. Zhao et al., “Ovis2.5 technical report,”arXiv preprint arXiv:2508.11737, 2025

Pith/arXiv arXiv 2025
[26]

VLMEvalKit: An open-source toolkit for evaluating large multi-modality models,

H. Duan, J. Yang, Y . Qiao, X. Fang, L. Chen, Y . Liu, X. Dong, Y . Zang, P. Zhang, J. Wanget al., “VLMEvalKit: An open-source toolkit for evaluating large multi-modality models,” inProc. of the 32nd ACM international conference on multimedia, 2024, pp. 11 198–11 201

2024
[27]

Siglip 2: Multilingual vision-language encoders with improved semantic understanding, localization, and dense features,

M. Tschannen, A. Gritsenko, X. Wang, M. F. Naeem, I. Alabdul- mohsin, N. Parthasarathy, T. Evans, L. Beyer, Y . Xia, B. Mustafa et al., “Siglip 2: Multilingual vision-language encoders with improved semantic understanding, localization, and dense features,”arXiv preprint arXiv:2502.14786, 2025

Pith/arXiv arXiv 2025
[28]

Patch n’ pack: Navit, a vision transformer for any aspect ratio and resolution,

M. Dehghani, B. Mustafa, J. Djolonga, J. Heek, M. Minderer, M. Caron, A. Steiner, J. Puigcerver, R. Geirhos, I. M. Alabdulmohsin, A. Oliver, P. Padlewski, A. Gritsenko, M. Lucic, and N. Houlsby, “Patch n’ pack: Navit, a vision transformer for any aspect ratio and resolution,” in Advances in Neural Information Processing Systems, vol. 36, 2023, pp. 2252–2274

2023
[29]

Qwen3 technical report,

Z. Bai, A. Yang, Y . Zheng, J. Ye, X. Ma, X. Wang, X. Wei, X. Fan, S. Zhanget al., “Qwen3 technical report,”arXiv preprint arXiv:2501.13694, 2025

arXiv 2025
[30]

RoadText-1K: Text detection & recognition dataset for driving videos,

S. Reddy, M. Mathew, L. Gomez, M. Rusinol, D. Karatzas, and C. Jawahar, “RoadText-1K: Text detection & recognition dataset for driving videos,” inProc. of the 2020 IEEE International Conference on Robotics and Automation, 2020, pp. 11 074–11 080

2020
[31]

Evaluating multiple object tracking performance: the clear mot metrics,

K. Bernardin and R. Stiefelhagen, “Evaluating multiple object tracking performance: the clear mot metrics,”EURASIP Journal on Image and Video Processing, vol. 2008, no. 1, p. 246309, 2008

2008
[32]

Performance measures and a data set for multi-target, multi-camera tracking,

E. Ristani, F. Solera, R. Zou, R. Cucchiara, and C. Tomasi, “Performance measures and a data set for multi-target, multi-camera tracking,” inProc. of the 14th European Conference on Computer Vision, 2016, pp. 17–35

2016

[1] [1]

Scene text recognition for text-based traffic signs,

Y . Taki and E. Zemmouri, “Scene text recognition for text-based traffic signs,” inAdvances in Intelligent Traffic and Transportation Systems, 2023, pp. 67–77

2023

[2] [2]

Scene text detection and recognition: The deep learning era,

S. Long, X. He, and C. Yao, “Scene text detection and recognition: The deep learning era,”International Journal of Computer Vision, vol. 129, no. 1, pp. 161–184, 2021

2021

[3] [3]

End- to-end video text spotting with transformer,

W. Wu, Y . Cai, C. Shen, D. Zhang, Y . Fu, H. Zhou, and P. Luo, “End- to-end video text spotting with transformer,”International Journal of Computer Vision, vol. 132, no. 9, pp. 4019–4035, 2024

2024

[4] [4]

GoMatching: A simple baseline for video text spotting via long and short term matching,

H. He, M. Ye, J. Zhang, J. Liu, B. Du, and D. Tao, “GoMatching: A simple baseline for video text spotting via long and short term matching,” Advances in Neural Information Processing Systems, vol. 37, pp. 25 663– 25 686, 2024

2024

[5] [5]

GoMatching++: Parameter-and data-efficient arbitrary-shaped video text spotting and benchmarking,

H. He, J. Zhang, M. Ye, J. Liu, B. Du, and D. Tao, “GoMatching++: Parameter-and data-efficient arbitrary-shaped video text spotting and benchmarking,”arXiv preprint arXiv:2505.22228, 2025

arXiv 2025

[6] [6]

DeepSolo: Let transformer decoder with explicit points solo for text spotting,

M. Ye, J. Zhang, S. Zhao, J. Liu, T. Liu, B. Du, and D. Tao, “DeepSolo: Let transformer decoder with explicit points solo for text spotting,” in Proc. of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 19 348–19 357

2023

[7] [7]

Scene text recognition with permuted au- toregressive sequence models,

D. Bautista and R. Atienza, “Scene text recognition with permuted au- toregressive sequence models,” inProc. of the 14th European Conference on Computer Vision, 2022

2022

[8] [8]

SVTR: Scene text recognition with a single visual model,

Y . Du, Z. Chen, C. Jia, X. Yin, T. Zheng, C. Li, Y . Du, and Y .-G. Jiang, “SVTR: Scene text recognition with a single visual model,” inProc. of the 31st International Joint Conference on Artificial Intelligence, 2022

2022

[9] [9]

A bilingual, openworld video text dataset and end-to-end video text spotter with transformer,

W. Wu, Y . Cai, D. Zhang, S. Wang, Z. Li, J. Li, Y . Tang, and H. Zhou, “A bilingual, openworld video text dataset and end-to-end video text spotter with transformer,”Advances in Neural Information Processing Systems, vol. 34, 2021

2021

[10] [10]

DSText V2: A comprehensive video text spotting dataset for dense and small text,

W. Wu, Y . Zhang, Y . He, L. Zhang, Z. Lou, H. Zhou, and X. Bai, “DSText V2: A comprehensive video text spotting dataset for dense and small text,”Pattern Recognition, vol. 149, p. 110177, 2024

2024

[11] [11]

ICDAR 2015 competition on robust reading,

D. Karatzas, L. Gomez-Bigorda, A. Nicolaou, S. Ghosh, A. Bagdanov, M. Iwamura, J. Matas, L. Neumann, V . R. Chandrasekhar, S. Luet al., “ICDAR 2015 competition on robust reading,” inProc. of the 13th International Conference on Document Analysis and Recognition, 2015, pp. 1156–1160

2015

[12] [12]

Textssr: Diffusion-based data synthesis for scene text recognition,

X. Ye, Y . Du, Y . Tao, and Z. Chen, “Textssr: Diffusion-based data synthesis for scene text recognition,” inProc. of the 2025 IEEE/CVF International Conference on Computer Vision, 2025, pp. 17 464–17 473

2025

[13] [13]

LoRA: Low-rank adaptation of large language models,

E. J. Hu, Y . Shen, P. Wallis, Z. Allen-Zhu, Y . Li, S. Wang, L. Wang, and W. Chen, “LoRA: Low-rank adaptation of large language models,” in Proc. of the 10th International Conference on Learning Representations, 2022

2022

[14] [14]

Q-Adapter: Visual query adapter for extracting textually-related features in video captioning,

J. Chen, T. T. Nguyen, T. Komamizu, and I. Ide, “Q-Adapter: Visual query adapter for extracting textually-related features in video captioning,” inProceedings of the 7th ACM International Conference on Multimedia in Asia, 2025

2025

[15] [15]

An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition,

B. Shi, X. Bai, and C. Yao, “An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 39, pp. 2298–2304, 2015

2015

[16] [16]

Read like humans: Autonomous, bidirectional and iterative language modeling for scene text recognition,

S. Fang, H. Xie, Y . Wang, Z. Mao, and Y . Zhang, “Read like humans: Autonomous, bidirectional and iterative language modeling for scene text recognition,” inProc. of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 7094–7103

2021

[17] [17]

SVIPTR: Fast and efficient scene text recognition with vision permutable extractor,

X. Cheng, W. Zhou, X. Li, X. Chen, J. Yang, T. Li, and Z. Li, “SVIPTR: Fast and efficient scene text recognition with vision permutable extractor,” inProc. of the 33rd ACM International Conference on Information and Knowledge Management, 2024

2024

[18] [18]

DiffusionSTR: Diffusion model for scene text recognition,

M. Fujitake, “DiffusionSTR: Diffusion model for scene text recognition,” inProc. of the 2023 IEEE International Conference on Image Processing, 2023, pp. 1585–1589

2023

[19] [19]

Mask textspotter v3: Segmentation proposal network for robust scene text spotting,

M. Liao, G. Pang, J. Huang, T. Hassner, and X. Bai, “Mask textspotter v3: Segmentation proposal network for robust scene text spotting,” in Proc. of the 16th European conference on computer vision, 2020, pp. 706–722

2020

[20] [20]

Swintextspotter: Scene text spotting via better synergy between text detection and text recognition,

M. Huang, Y . Liu, Z. Peng, C. Liu, D. Lin, S. Zhu, N. Yuan, K. Ding, and L. Jin, “Swintextspotter: Scene text spotting via better synergy between text detection and text recognition,” inProc. of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 4593–4603

2022

[21] [21]

FREE: A fast and robust end-to-end video text spotter,

Z. Cheng, J. Lu, B. Zou, L. Qiao, Y . Xu, S. Pu, Y . Niu, F. Wu, and S. Zhou, “FREE: A fast and robust end-to-end video text spotter,”IEEE Transactions on Image Processing, vol. 30, pp. 822–837, 2020

2020

[22] [22]

CLIP4STR: A simple baseline for scene text recognition with pre-trained vision-language model,

S. Zhao, X. Wang, L. Zhu, and Y . Yang, “CLIP4STR: A simple baseline for scene text recognition with pre-trained vision-language model,”IEEE Transactions on Image Processing, vol. 33, pp. 6893–6904, 2023

2023

[23] [23]

Learning transferable visual models from natural language supervision,

A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever, “Learning transferable visual models from natural language supervision,” inProc. of the 38th International Conference on Machine Learning, 2021

2021

[24] [24]

Nougat: Neural optical understanding for academic documents,

L. Blecher, G. Cucurull, T. Scialom, and R. Stojnic, “Nougat: Neural optical understanding for academic documents,” inProc. of the 2024 International Conference on Learning Representations, 2024

2024

[25] [25]

Ovis2.5 technical report,

S. Lu, Y . Li, Y . Xia, Y . Hu, S. Zhao, Y . Ma, Z. Wei, Y . Li, L. Duan, J. Zhao et al., “Ovis2.5 technical report,”arXiv preprint arXiv:2508.11737, 2025

Pith/arXiv arXiv 2025

[26] [26]

VLMEvalKit: An open-source toolkit for evaluating large multi-modality models,

H. Duan, J. Yang, Y . Qiao, X. Fang, L. Chen, Y . Liu, X. Dong, Y . Zang, P. Zhang, J. Wanget al., “VLMEvalKit: An open-source toolkit for evaluating large multi-modality models,” inProc. of the 32nd ACM international conference on multimedia, 2024, pp. 11 198–11 201

2024

[27] [27]

Siglip 2: Multilingual vision-language encoders with improved semantic understanding, localization, and dense features,

M. Tschannen, A. Gritsenko, X. Wang, M. F. Naeem, I. Alabdul- mohsin, N. Parthasarathy, T. Evans, L. Beyer, Y . Xia, B. Mustafa et al., “Siglip 2: Multilingual vision-language encoders with improved semantic understanding, localization, and dense features,”arXiv preprint arXiv:2502.14786, 2025

Pith/arXiv arXiv 2025

[28] [28]

Patch n’ pack: Navit, a vision transformer for any aspect ratio and resolution,

M. Dehghani, B. Mustafa, J. Djolonga, J. Heek, M. Minderer, M. Caron, A. Steiner, J. Puigcerver, R. Geirhos, I. M. Alabdulmohsin, A. Oliver, P. Padlewski, A. Gritsenko, M. Lucic, and N. Houlsby, “Patch n’ pack: Navit, a vision transformer for any aspect ratio and resolution,” in Advances in Neural Information Processing Systems, vol. 36, 2023, pp. 2252–2274

2023

[29] [29]

Qwen3 technical report,

Z. Bai, A. Yang, Y . Zheng, J. Ye, X. Ma, X. Wang, X. Wei, X. Fan, S. Zhanget al., “Qwen3 technical report,”arXiv preprint arXiv:2501.13694, 2025

arXiv 2025

[30] [30]

RoadText-1K: Text detection & recognition dataset for driving videos,

S. Reddy, M. Mathew, L. Gomez, M. Rusinol, D. Karatzas, and C. Jawahar, “RoadText-1K: Text detection & recognition dataset for driving videos,” inProc. of the 2020 IEEE International Conference on Robotics and Automation, 2020, pp. 11 074–11 080

2020

[31] [31]

Evaluating multiple object tracking performance: the clear mot metrics,

K. Bernardin and R. Stiefelhagen, “Evaluating multiple object tracking performance: the clear mot metrics,”EURASIP Journal on Image and Video Processing, vol. 2008, no. 1, p. 246309, 2008

2008

[32] [32]

Performance measures and a data set for multi-target, multi-camera tracking,

E. Ristani, F. Solera, R. Zou, R. Cucchiara, and C. Tomasi, “Performance measures and a data set for multi-target, multi-camera tracking,” inProc. of the 14th European Conference on Computer Vision, 2016, pp. 17–35

2016