pith. machine review for the scientific record. sign in

arxiv: 2604.27712 · v1 · submitted 2026-04-30 · 💻 cs.CV · cs.CL

Recognition: unknown

Linguistically Informed Multimodal Fusion for Vietnamese Scene-Text Image Captioning: Dataset, Graph Framework, and Phonological Attention

Anh-Duc Nguyen, Kiet Van Nguyen, Ngan Luu-Thuy Nguyen, Nghia Hieu Nguyen, Nhi Ngoc-Yen Nguyen

Authors on Pith no claims yet

Pith reviewed 2026-05-07 06:06 UTC · model grok-4.3

classification 💻 cs.CV cs.CL
keywords Vietnamese scene-text captioningmultimodal fusiongraph neural networksphonological attentionViTextCaps datasetdiacritic collisionOCR errorstonal language
0
0 comments X

The pith

Cross-modal graph edges degrade scene-text fusion, so Vietnamese captioning needs phonological attention instead.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper establishes that standard multimodal fusion fails for Vietnamese scene-text image captioning because it treats text as language-agnostic and ignores tonal features where diacritics alter word meanings plus pervasive OCR errors. The authors introduce the HSTFG graph fusion framework with learned spatial attention bias and use topology analysis to demonstrate that cross-modal edges linking visual and text nodes are harmful. They therefore specialize the model into PhonoSTFG, which adds phonological attention to embed Vietnamese linguistic structure directly into the fusion process. They support evaluation with the new ViTextCaps dataset of 15,729 images and 74,970 captions, where linguistic analysis finds 52.8 percent of vocabulary at risk of diacritic collision. A sympathetic reader would care because faithful integration of visible text into scene descriptions is essential for accessibility and search in tonal languages where small orthographic changes flip meanings.

Core claim

Vietnamese scene-text image captioning requires linguistically informed multimodal fusion. The Heterogeneous Scene-Text Fusion Graph (HSTFG) with learned spatial attention bias shows through topology analysis that cross-modal graph edges are harmful for fusion. Specializing this design yields the Phonological Scene-Text Fusion Graph (PhonoSTFG), which incorporates phonological attention to handle Vietnamese tonal and diacritic reasoning. The claim rests on the introduced ViTextCaps dataset, where 52.8 percent of the vocabulary risks meaning change from missing diacritics.

What carries the argument

PhonoSTFG, the Phonological Scene-Text Fusion Graph that extends the general HSTFG framework by adding phonological attention to integrate linguistic knowledge and resolve tonal and diacritic ambiguities during Vietnamese scene-text captioning.

If this is right

  • Graph fusion for scene-text images should omit direct cross-modal edges to avoid performance degradation.
  • Phonological attention can mitigate diacritic collisions and tone ambiguities that standard fusion misses in tonal languages.
  • The ViTextCaps dataset supplies a benchmark revealing that over half the vocabulary in Vietnamese scene-text captions is diacritic-sensitive.
  • Learned spatial attention bias improves edge weighting within modality-specific subgraphs without needing cross-modal links.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The result on harmful cross-modal edges may prompt re-examination of graph fusion designs across other vision-language tasks that currently connect modalities densely.
  • PhonoSTFG-style phonological attention could transfer to other tonal languages such as Thai or Mandarin where diacritics or tones similarly affect word identity.
  • Stronger phonological modeling inside fusion might lessen the downstream damage caused by typical OCR mistakes on diacritic-rich text.
  • The dataset statistics suggest future OCR systems for Vietnamese should prioritize diacritic preservation as a core accuracy metric.

Load-bearing premise

The topology finding that cross-modal edges are harmful will hold on other datasets and phonological attention will deliver measurable gains not offset by OCR errors or over-specialization to the ViTextCaps collection.

What would settle it

Repeating the topology analysis on a non-Vietnamese scene-text dataset where adding cross-modal edges raises BLEU or CIDEr scores, or showing that PhonoSTFG yields no improvement over HSTFG when supplied with ground-truth text rather than OCR output.

Figures

Figures reproduced from arXiv: 2604.27712 by Anh-Duc Nguyen, Kiet Van Nguyen, Ngan Luu-Thuy Nguyen, Nghia Hieu Nguyen, Nhi Ngoc-Yen Nguyen.

Figure 1
Figure 1. Figure 1: A representative ViTextCaps sample comprising one image and five human view at source ↗
Figure 2
Figure 2. Figure 2: PCA biplot of ten per-image statistics computed over all 15,729 ViTextCaps im view at source ↗
Figure 3
Figure 3. Figure 3: Hierarchical-cluster-ordered Pearson correlation heatmap of 10 per-image statis view at source ↗
Figure 4
Figure 4. Figure 4: Collision-group size distribution in the Vietnamese caption vocabulary ( view at source ↗
Figure 5
Figure 5. Figure 5: Per-caption OCR-token coverage rate, stratified by text usage type (KDE ridges view at source ↗
Figure 6
Figure 6. Figure 6: Joint distribution of OCR tokens in salience space (confidence view at source ↗
Figure 7
Figure 7. Figure 7: HSTFG: Heterogeneous Scene-Text Fusion Graph. Full pipeline: a Faster R-CNN/VinVL visual encoder and a SwinTextSpotter OCR detector produce vi￾sual region embeddings (V nodes) and OCR text token embeddings (T nodes), both pro￾jected to d=768. Three configurable edge types (V→T, T→V, T→T) with independent WQ, WK, WV projections fuse all modalities across L=3 spatial graph attention layers, after which an MM… view at source ↗
Figure 8
Figure 8. Figure 8: PhonoSTFG: Phonologically-Enhanced T→T Fusion. Full pipeline: the OCR token embedding is replaced by a dual-stream architecture combining a visual OCR stream (v vis, L2-normalised recognition and detection features) and a frozen PhoBERT linguistic stream (v pho , ∼135M parameters); a learned gate g=σ(Wg[v vis; v pho]) fuses them per-dimension. The graph is restricted to T→T edges; V node embeddings bypass … view at source ↗
Figure 9
Figure 9. Figure 9: Phonological structure of a Vietnamese syllable, following the template view at source ↗
Figure 10
Figure 10. Figure 10: Main results on the ViTextCaps test set. Bars show corpus scores for the view at source ↗
Figure 11
Figure 11. Figure 11: Ablation analysis of the fusion architecture. view at source ↗
read the original abstract

Scene-text image captioning requires fusing three information streams -- visual features, OCR-detected text, and linguistic knowledge -- to generate descriptions that faithfully integrate text visible in images. Existing fusion approaches treat text as language-agnostic, which fails for Vietnamese: a tonal language where diacritics alter word meaning, OCR errors are pervasive, and word boundaries are ambiguous. We argue that Vietnamese scene-text captioning demands \textit{linguistically informed multimodal fusion}, where language-specific structural knowledge is explicitly incorporated into the fusion mechanism. Motivated from these insights, we propose \textbf{HSTFG} (Heterogeneous Scene-Text Fusion Graph), a general-purpose graph fusion framework with learned spatial attention bias, and show through topology analysis that cross-modal graph edges are harmful for scene-text fusion. Building on this finding, we design \textbf{PhonoSTFG} (Phonological Scene-Text Fusion Graph) which specializes graph-level fusion for Vietnamese linguistic reasoning. To support evaluation, we introduce \textbf{ViTextCaps}, the first large-scale Vietnamese scene-text captioning dataset (\textbf{15{,}729} images with \textbf{74{,}970} captions), with comprehensive linguistic analysis showing that 52.8\% of the vocabulary is at risk of diacritic collision.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 3 minor

Summary. The manuscript introduces ViTextCaps, a new dataset of 15,729 Vietnamese scene-text images paired with 74,970 captions, along with linguistic analysis indicating that 52.8% of the vocabulary risks diacritic collision. It proposes the Heterogeneous Scene-Text Fusion Graph (HSTFG) as a general graph-based fusion framework incorporating learned spatial attention bias to combine visual features, OCR text, and linguistic knowledge. Topology analysis is used to conclude that cross-modal graph edges are harmful for scene-text fusion. This finding motivates the Phonological Scene-Text Fusion Graph (PhonoSTFG), which specializes the framework with phonological attention to address Vietnamese-specific issues such as tonal diacritics, OCR errors, and word-boundary ambiguity. The central claim is that linguistically informed multimodal fusion via phonological attention yields improved Vietnamese scene-text image captioning.

Significance. If the topology analysis and performance gains hold after addressing methodological controls, the work would provide a useful new benchmark dataset for non-English scene-text captioning and a graph fusion approach adapted to tonal languages. The emphasis on language-specific structural knowledge (phonology) and real-world OCR challenges in Vietnamese could inform similar adaptations for other low-resource or morphologically complex languages. The dataset scale and explicit linguistic analysis are positive contributions that could support future multilingual multimodal research.

major comments (1)
  1. [Topology Analysis] Topology Analysis section: the central claim that cross-modal graph edges are harmful (and thus motivate shifting from HSTFG to PhonoSTFG) rests on a comparison that removes those edges. Removing cross-modal edges necessarily reduces total edge count, average degree, and changes message-passing paths in the heterogeneous graph. No control experiment is described that holds edge count or connectivity fixed (e.g., by randomly pruning an equal number of intra-modal edges while preserving cross-modal ones, or by inserting neutral edges). Consequently, any performance change cannot be unambiguously attributed to the semantic harm of cross-modal fusion rather than generic sparsity or Laplacian effects. This is load-bearing for the argument that phonological attention is required as the remedy.
minor comments (3)
  1. [Abstract] Abstract: states that topology analysis shows cross-modal edges are harmful and that PhonoSTFG improves fusion, yet provides no quantitative results, baselines, error bars, ablation details, or specific metrics supporting these claims.
  2. [Dataset] Dataset section: the claim of 'comprehensive linguistic analysis' for the 52.8% diacritic-collision risk is presented without details on the computation method, vocabulary size, or how this statistic directly impacts captioning performance or OCR error rates.
  3. [Abstract] Notation and terminology: the acronyms HSTFG and PhonoSTFG are introduced in the abstract and title without immediate parenthetical expansion, which may reduce readability for readers unfamiliar with the framework.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive and detailed review. The methodological concern regarding the topology analysis is valid and directly impacts the strength of our central claim. We address it point-by-point below and will revise the manuscript to incorporate the suggested control.

read point-by-point responses
  1. Referee: [Topology Analysis] Topology Analysis section: the central claim that cross-modal graph edges are harmful (and thus motivate shifting from HSTFG to PhonoSTFG) rests on a comparison that removes those edges. Removing cross-modal edges necessarily reduces total edge count, average degree, and changes message-passing paths in the heterogeneous graph. No control experiment is described that holds edge count or connectivity fixed (e.g., by randomly pruning an equal number of intra-modal edges while preserving cross-modal ones, or by inserting neutral edges). Consequently, any performance change cannot be unambiguously attributed to the semantic harm of cross-modal fusion rather than generic sparsity or Laplacian effects. This is load-bearing for the argument that phonological attention is required as the remedy.

    Authors: We agree that the existing comparison does not hold total edge count or average degree fixed, and therefore cannot unambiguously attribute performance differences to the semantic content of the cross-modal edges rather than generic effects of graph sparsity or altered message-passing paths. In the revised manuscript we will add an explicit control experiment that randomly prunes an equal number of intra-modal edges while retaining all cross-modal edges, thereby matching the edge count of the no-cross-modal variant. We will report the resulting captioning metrics, update the topology analysis section with the new results, and discuss whether the performance drop remains larger when cross-modal edges are removed. This control will strengthen (or, if necessary, qualify) the motivation for introducing phonological attention in PhonoSTFG. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical topology analysis and dataset introduction are independent of self-referential inputs

full rationale

The paper introduces a new dataset (ViTextCaps) and two graph frameworks (HSTFG, PhonoSTFG) whose central claims rest on empirical performance comparisons across graph topologies and linguistic properties of Vietnamese. The statement that cross-modal edges are harmful is presented as the outcome of topology analysis on trained models rather than any equation or fitted parameter that reduces the result to its own inputs by construction. No self-citation chains, ansatzes smuggled via prior work, or uniqueness theorems are invoked to force the phonological attention mechanism; the specialization is motivated directly by diacritic and tonal characteristics described in the linguistic analysis of the new data. The derivation chain therefore remains self-contained against external benchmarks such as standard captioning metrics.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 2 invented entities

Claims rest on the assumption that graph neural networks can usefully model spatial and cross-modal relationships in scene-text images and that phonological features provide an independent signal for resolving Vietnamese diacritic ambiguities.

free parameters (1)
  • learned spatial attention bias
    Component of HSTFG mentioned as learned during training to adjust graph edges.
axioms (2)
  • domain assumption Graph neural networks with attention can effectively integrate visual, OCR, and linguistic streams for caption generation.
    Core modeling choice underlying both HSTFG and PhonoSTFG.
  • domain assumption Phonological information is necessary and sufficient to mitigate diacritic collision and OCR errors in Vietnamese.
    Motivation for specializing the graph to phonological attention.
invented entities (2)
  • HSTFG no independent evidence
    purpose: Heterogeneous Scene-Text Fusion Graph with learned spatial attention bias for general multimodal fusion.
    Newly proposed general-purpose framework.
  • PhonoSTFG no independent evidence
    purpose: Phonological Scene-Text Fusion Graph that specializes fusion for Vietnamese linguistic reasoning.
    Newly proposed language-specific variant.

pith-pipeline@v0.9.0 · 5560 in / 1637 out tokens · 86140 ms · 2026-05-07T06:06:47.645094+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

68 extracted references · 50 canonical work pages · 1 internal anchor

  1. [1]

    URLhttps://doi.org/10.1007/ 978-3-030-58536-5_44

    O. Sidorov, R. Hu, M. Rohrbach, A. Singh, TextCaps: A dataset for image captioning with reading comprehension, in: Computer Vision – ECCV 2020, 2020, pp. 742–758. doi:10.1007/978-3-030-58536-5\_44

  2. [2]

    Z. Wang, J. Bao, W. Zhou, W. Zhu, J.-Y. Li, Confidence-aware non- repetitive multimodal transformers for TextCaps, in: Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, 2021, pp. 2835–2843. doi:10.1609/aaai.v35i4.16389

  3. [3]

    R. Hu, A. Singh, T. Darrell, M. Rohrbach, Iterative answer predic- tion with pointer-augmented multimodal transformers for TextVQA, in: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recog- nition (CVPR), 2020, pp. 9989–9999. doi:10.1109/CVPR42600.2020. 01001

  4. [4]

    J. Tang, Q. Liu, Y. Ye, J. Lu, S. Wei, A.-L. Wang, C. Lin, H. Feng, Z. Zhao, Y. Wang, Y. Liu, H. Liu, X. Bai, C. Huang, MTVQA: Bench- 48 marking multilingual text-centric visual question answering, in: Find- ings of the Association for Computational Linguistics: ACL 2025, 2025, pp. 7748–7763. doi:10.18653/v1/2025.findings-acl.404

  5. [5]

    Q. V. Nguyen, N. H. Nguyen, K. Van Nguyen, et al., ViTextVQA: A large-scale visual question answering dataset for evaluating Vietnamese text comprehension in images, Expert Systems with Applications 308 (2026) 130839. doi:10.1016/j.eswa.2025.130839

  6. [6]

    H. Q. Pham, T. K.-B. Nguyen, Q. Van Nguyen, D. Q. Tran, N. H. Nguyen, K. Van Nguyen, N. L.-T. Nguyen, ViOCR VQA: Novel bench- mark dataset and VisionReader for visual question answering by under- standing Vietnamese text in images, Multimedia Systems 31 (2025) 106. doi:10.1007/s00530-025-01696-7

  7. [7]

    H. M. Nguyen, T. L.-T. Dang, K. Van Nguyen, Towards signboard- oriented visual question answering: ViSignVQA dataset, method and benchmark, arXiv preprint arXiv:2512.22218 (2025)

  8. [8]

    N. H. Tran, D. T. Duong, K. Nguyen, K. V. Phan, N. H. Tran, N. Nguyen, T. Nguyen, OpenViVQA: Task, dataset, and multimodal fusion models for visual question answering in Vietnamese, Inf. Fusion 100 (2023). doi:10.1016/j.inffus.2023.101868

  9. [9]

    Q. H. Lam, Q. D. Le, V. K. Nguyen, N. L.-T. Nguyen, UIT-ViIC: A dataset for the first evaluation on Vietnamese image captioning, in: Proc. Int. Conf. Comput. Collective Intell., 2020, pp. 730–742. doi:10. 1007/978-3-030-63007-2\_57

  10. [10]

    D. C. Bui, N. H. Nguyen, K. Nguyen, UIT-OpenViIC: An open- domain benchmark for evaluating image captioning in Vietnamese, Sig- nal Processing: Image Communication 140 (2026) 117430. doi:10.1016/ j.image.2025.117430

  11. [11]

    ACDC: The adverse conditions dataset with correspondences for robust semantic driving scene perception,

    T. Baltruˇ saitis, C. Ahuja, L.-P. Morency, Multimodal machine learning: A survey and taxonomy, IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) 41 (2019) 423–443. doi:10.1109/TPAMI. 2018.2798607

  12. [12]

    Zhang, Z

    C. Zhang, Z. Yang, X. He, L. Deng, Multimodal intelligence: Represen- tation learning, information fusion, and applications, IEEE Journal of 49 Selected Topics in Signal Processing 14 (2020) 478–493. doi:10.1109/ JSTSP.2020.2987728

  13. [13]

    P. K. Atrey, M. A. Hossain, A. El Saddik, M. S. Kankanhalli, Multi- modal fusion for multimedia analysis: a survey, Multimedia Syst. 16 (2010) 345–379. doi:10.1007/s00530-010-0182-0

  14. [14]

    Lahat, T

    D. Lahat, T. Adali, C. Jutten, Multimodal data fusion: An overview of methods, challenges, and prospects, Proceedings of the IEEE 103 (2015) 1449–1477. doi:10.1109/JPROC.2015.2460697

  15. [15]

    Ngiam, A

    J. Ngiam, A. Khosla, M. Kim, J. Nam, H. Lee, A. Y. Ng, Multimodal deep learning, in: Proceedings of the 28th International Conference on International Conference on Machine Learning (ICML), 2011, pp. 689–696. doi:10.5555/3104482.3104569

  16. [16]

    C. G. M. Snoek, M. Worring, A. W. M. Smeulders, Early versus late fusion in semantic video analysis, in: Proceedings of the 13th An- nual ACM International Conference on Multimedia, 2005, pp. 399–402. doi:10.1145/1101149.1101236

  17. [17]

    Vaswani, N

    A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, I. Polosukhin, Attention is all you need, in: Proceedings of the 31st International Conference on Neural Information Processing Systems (NeurIPS), volume 30, 2017, pp. 6000—-6010. doi:10.5555/ 3295222.3295349

  18. [18]

    J. Lu, D. Batra, D. Parikh, S. Lee, ViLBERT: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks, in: Ad- vances in Neural Information Processing Systems (NeurIPS), volume 32, 2019

  19. [19]

    H. Tan, M. Bansal, LXMERT: Learning cross-modality encoder repre- sentations from transformers, in: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th Inter- national Joint Conference on Natural Language Processing (EMNLP- IJCNLP), 2019, pp. 5100–5111. doi:10.18653/v1/D19-1514

  20. [20]

    Arevalo, T

    J. Arevalo, T. Solorio, M. Montes-y Gómez, F. A. González, Gated multimodal units for information fusion, in: Proc. 5th Int. Conf. Learn. Represent. Workshop Track (ICLR Workshop), 2017. 50

  21. [21]

    J. Yang, P. Wang, Y. Zhu, M. Feng, M. Chen, X. He, Gated multimodal fusion with contrastive learning for turn-taking prediction in multiparty conversation, in: ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2022, pp. 7747–

  22. [22]

    doi:10.1109/ICASSP43922.2022.9747056

  23. [23]

    Aggarwal, Y

    D. Zhang, R. Cao, S. Wu, Information fusion in visual question answer- ing: A survey, Information Fusion 52 (2019) 268–280. doi:10.1016/j. inffus.2019.03.005

  24. [24]

    Gandhi, K

    A. Gandhi, K. Adhvaryu, S. Poria, E. Cambria, A. Hussain, Multimodal sentiment analysis: A systematic review of history, datasets, multimodal fusion methods, applications, challenges and future directions, Informa- tion Fusion 91 (2023) 424–444. doi:10.1016/j.inffus.2022.09.025

  25. [25]

    Gkoumas, Q

    D. Gkoumas, Q. Li, C. Lioma, Y. Yu, D. Song, What makes the difference? An empirical comparison of fusion strategies for mul- timodal language analysis, Information Fusion 66 (2021) 184–197. doi:10.1016/j.inffus.2020.09.005

  26. [26]

    Zhang, J

    W. Zhang, J. Yu, H. Hu, H. Hu, Z. Qin, Multimodal feature fusion by re- lational reasoning and attention for visual question answering, Informa- tion Fusion 55 (2020) 116–126. doi:10.1016/j.inffus.2019.08.009

  27. [27]

    Zhang, M

    S. Zhang, M. Chen, J. Chen, F. Zou, Y. Li, P. Lu, Multimodal feature- wise co-attention method for visual question answering, Information Fusion 73 (2021) 1–10. doi:10.1016/j.inffus.2021.02.022

  28. [28]

    T. Yao, Y. Pan, Y. Li, T. Mei, Exploring visual relationship for image captioning, in: Computer Vision – ECCV 2018, 2018, pp. 711–727. doi:10.1007/978-3-030-01264-9_42

  29. [29]

    X. Yang, K. Tang, H. Zhang, J. Cai, Auto-encoding scene graphs for image captioning, in: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019, pp. 10677–10686. doi:10.1109/ CVPR.2019.01094

  30. [30]

    D. Gao, K. Li, R. Wang, S. Shan, X. Chen, Multi-modal graph neural network for joint reasoning on vision and scene text, in: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020, pp. 12743–12753. doi:10.1109/CVPR42600.2020.01276. 51

  31. [31]

    S. Yun, M. Jeong, R. Kim, J. Kang, H. J. Kim, Graph transformer networks, in: Advances in Neural Information Processing Systems, vol- ume 32, 2019

  32. [32]

    Huang, S

    W. Zhang, J. Yu, W. Zhao, C. Ran, DMRFNet: Deep multimodal reasoning and fusion for visual question answering and explanation gen- eration, Information Fusion 72 (2021) 70–79. doi:10.1016/j.inffus. 2021.02.006

  33. [33]

    Q. Li, Z. Han, X.-m. Wu, Deeper insights into graph convolutional net- works for semi-supervised learning, in: 32nd AAAI Conference on Arti- ficial Intelligence (AAAI), volume 32, 2018. doi:10.1609/aaai.v32i1. 11604

  34. [34]

    Vinyals, A

    O. Vinyals, A. Toshev, S. Bengio, D. Erhan, Show and tell: A neural image caption generator, in: 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015, pp. 3156–3164. doi:10.1109/ CVPR.2015.7298935

  35. [35]

    K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhudinov, R. Zemel, Y. Bengio, Show, attend and tell: Neural image caption generation with visual attention, in: Proceedings of the 32nd International Conference on Machine Learning, volume 37, PMLR, 2015, pp. 2048–2057

  36. [36]

    Anderson, X

    P. Anderson, X. He, C. Buehler, D. Teney, M. Johnson, S. Gould, L. Zhang, Bottom-up and top-down attention for image caption- ing and visual question answering, in: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018, pp. 6077–6086. doi:10.1109/CVPR.2018.00636

  37. [37]

    J. Wang, J. Tang, J. Luo, Multimodal attention with image text spatial relationship for OCR-Based image captioning, in: Proceedings of the 28th ACM International Conference on Multimedia, 2020, pp. 4337–

  38. [38]

    doi:10.1145/3394171.3413753

  39. [39]

    Z. Yang, Y. Lu, J. Wang, X. Yin, D. Florencio, L. Wang, C. Zhang, L. Zhang, J. Luo, TAP: Text-aware pre-training for text-vqa and text-caption, in: 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021, pp. 8747–8757. doi:10.1109/ CVPR46437.2021.00864. 52

  40. [40]

    Cornia, M

    M. Cornia, M. Stefanini, L. Baraldi, R. Cucchiara, Meshed-memory transformer for image captioning, in: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020, pp. 10575– 10584. doi:10.1109/CVPR42600.2020.01059

  41. [42]

    S. Li, C. Gong, Y. Zhu, C. Luo, Y. Hong, X. Lv, Context-aware multi- level question embedding fusion for visual question answering, Informa- tion Fusion 102 (2024) 102000. doi:10.1016/j.inffus.2023.102000

  42. [43]

    OpenAI, GPT-4o system card, arXiv preprint arXiv:2410.21276 (2024)

  43. [44]

    D. Q. Nguyen, A. T. Nguyen, PhoBERT: Pre-trained language models for Vietnamese, in: Findings of the Association for Computational Lin- guistics: EMNLP 2020, 2020, pp. 1037–1042. doi:10.18653/v1/2020. findings-emnlp.92

  44. [45]

    L. Phan, H. Tran, H. Nguyen, T. H. Trinh, ViT5: Pretrained text-to- text transformer for Vietnamese language generation, in: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies: Student Research Workshop (SR W), 2022, pp. 136–142. doi:10.18653/v1/2022. naacl-srw.18

  45. [46]

    T.-P. Le, T. L. C. Phan, N. H. Nguyen, K. Van Nguyen, LiGT: layout- infused generative transformer for visual question answering on Viet- namese receipts, Int. J. Doc. Anal. Recognit. (IJDAR) 28 (2025) 717–

  46. [47]

    doi:10.1007/s10032-025-00515-z

  47. [48]

    X. Li, S. Dalmia, J. Li, M. Lee, P. Littell, J. Chang, A. W. Black, Universal phone recognition with a multilingual allophone system, in: ICASSP 2020 - 2020 IEEE International Conference on Acous- tics, Speech and Signal Processing (ICASSP), 2020, pp. 8249–8253. doi:10.1109/ICASSP40776.2020.9054362

  48. [49]

    R. E. Banchs, M. Zhang, X. Duan, H. Li, A. Kumaran, Report of NEWS 2015 machine transliteration shared task, in: Proc. 5th Named Entity 53 Workshop (NEWS), Association for Computational Linguistics, Beijing, China, 2015, pp. 10–23. doi:10.18653/v1/W15-3902

  49. [50]

    Enriching Word Vectors with Subword Information

    P. Bojanowski, E. Grave, A. Joulin, T. Mikolov, Enriching word vectors with subword information, Transactions of the Association for Compu- tational Linguistics 5 (2017) 135–146. doi:10.1162/tacl_a_00051

  50. [51]

    H. Hu, J. Gu, Z. Zhang, J. Dai, Y. Wei, Relation networks for object detection, in: 2018 IEEE/CVF Conference on Computer Vision and Pat- tern Recognition, 2018, pp. 3588–3597. doi:10.1109/CVPR.2018.00378

  51. [52]

    C. Zhu, M. Chen, S. Zhang, C. Sun, H. Liang, Y. Liu, J. Chen, SKEAFN: Sentiment knowledge enhanced attention fusion network for multimodal sentiment analysis, Information Fusion 100 (2023). doi:10.1016/j.inffus.2023.101958

  52. [53]

    Huang, Y

    M. Huang, Y. Liu, Z. Peng, C. Liu, D. Lin, S. Zhu, N. J. Yuan, K. Ding, L. Jin, SwinTextSpotter: Scene text spotting via better synergy between text detection and text recognition, in: 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022, pp. 4583–

  53. [54]

    doi:10.1109/CVPR52688.2022.00455

  54. [55]

    Zhang, X

    P. Zhang, X. Li, X. Hu, J. Yang, L. Zhang, L. Wang, Y. Choi, J. Gao, Vinvl: Revisiting visual representations in vision-language models, in: 2021 IEEE/CVF Conference on Computer Vision and Pattern Recog- nition (CVPR), 2021, pp. 5579–5588. doi:10.1109/CVPR46437.2021. 00553

  55. [56]

    T. T. Đoàn, Ngữ âm tiếng Việt, Nhà xuất bản Đại học Quốc gia Hà Nội, 2007. URL:https://archive.org/details/nguamtiengviet

  56. [57]

    L. C. Thompson, A Vietnamese grammar, University of Wash- ington Press, Seattle, 1965. URL:https://archive.org/details/ vietnamesegramma00thom

  57. [58]

    Papineni, S

    K. Papineni, S. Roukos, T. Ward, W.-J. Zhu, Bleu: a method for automatic evaluation of machine translation, in: Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, 2002, pp. 311–318. doi:10.3115/1073083.1073135. 54

  58. [59]

    Vedantam, C

    R. Vedantam, C. L. Zitnick, D. Parikh, CIDEr: Consensus-based image description evaluation, in: 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015, pp. 4566–4575. doi:10.1109/ CVPR.2015.7299087

  59. [60]

    Lin, ROUGE: A package for automatic evaluation of summaries, in: Text Summarization Branches Out, Association for Computational Linguistics, Barcelona, Spain, 2004, pp

    C.-Y. Lin, ROUGE: A package for automatic evaluation of summaries, in: Text Summarization Branches Out, Association for Computational Linguistics, Barcelona, Spain, 2004, pp. 74–81

  60. [61]

    Banerjee, A

    S. Banerjee, A. Lavie, METEOR: An automatic metric for MT evalua- tion with improved correlation with human judgments, in: Proceedings of the Second Workshop on Statistical Machine Translation, 2007, pp. 228—-231. doi:10.3115/1626355.1626389

  61. [62]

    In: Proceedings of the IEEE/CVF International Conference on Computer 16 J

    L. Huang, W. Wang, J. Chen, X.-Y. Wei, Attention on attention for im- age captioning, in: 2019 IEEE/CVF International Conference on Com- puter Vision (ICCV), 2019, pp. 4633–4642. doi:10.1109/ICCV.2019. 00473

  62. [63]

    D. P. Kingma, J. Ba, Adam: A method for stochastic optimization, in: International Conference on Learning Representations (ICLR), Open- Review.net, 2015

  63. [64]

    T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, C. L. Zitnick, Microsoft COCO: Common objects in context, in: Computer Vision – ECCV 2014, 2014, pp. 740–755. doi:10.1007/ 978-3-319-10602-1\_48

  64. [65]

    V. A. Vu, Underthesea: Vietnamese nlp toolkit, GitHub repository,

  65. [66]

    A vailable at: https://github.com/undertheseanlp/underthesea

  66. [67]

    ko”is retained rather than corrected to“không

    V. Tran, Pyvi: Python vietnamese toolkit, GitHub repository, 2016. A vailable at: https://github.com/traitrandev/pyvi. 55 Appendix A. Dataset Construction Details Appendix A.1. Annotation Protocol Annotator Recruitment.The annotation process involved30 undergrad- uate studentsrecruited from Vietnam National University Ho Chi Minh City (VNU-HCM) and affili...

  67. [68]

    color”, rank 1),có(“have

    are domain-specific terms that appear far more frequently than in general Vietnamese text. •Descriptive vocabulary:màu(“color”, rank 1),có(“have”, rank 2), trên(“on”, rank 6) are common Vietnamese function and descriptive words used to situate text within the visual scene. The co-occurrence of these two groups reflects the dual nature of scene- text capti...

  68. [69]

    line of text

    6 Unit: % N V E A CH M Nc R C FW Figure B.13: POS tag distribution of ViTextCaps reference captions (74,970 captions). Nouns dominate at 42.6%—substantially above typical image captioning datasets (∼30% in MSCOCO)—driven by OCR-extracted proper nouns, store names, and brand names. Underthesea [61]. Appendix B.8.1. Part-of-Speech Distribution Figure B.13 s...