pith. sign in

arxiv: 2606.24118 · v1 · pith:4FBPJ5J2new · submitted 2026-06-23 · 💻 cs.CV

An LMM for Precisely Grounding Elements in Documents

Pith reviewed 2026-06-26 01:49 UTC · model grok-4.3

classification 💻 cs.CV
keywords document element groundinglarge multimodal modelssynthetic datavisual groundingreinforcement learningdocument VQAspatial groundingelement localization
0
0 comments X

The pith

PreciseDoc is an LMM that grounds document elements precisely by training on synthetic hand-filled documents with coordinate metadata and using reinforcement learning to jointly supervise grounding and reasoning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

PreciseDoc is presented as an LMM tailored for precise element grounding in text-rich documents, where existing models often fail to locate critical elements accurately. The approach relies on generating large amounts of synthetic documents with exact coordinate metadata through pipelines that simulate hand-filled forms and camera distortions. Training incorporates a reinforcement learning paradigm that supervises both grounding and reasoning together to make better use of located evidence. Evaluations on benchmarks show improvements in both grounding precision and downstream document understanding. This matters for applications requiring reliable localization such as research and error detection in documents.

Core claim

PreciseDoc is an LMM specifically designed for precise element grounding in documents. It uses two synthetic data generation pipelines to create high-quality training examples with fine-grained coordinates, including hand-filled documents with camera effects. A training paradigm jointly supervises grounding and reasoning with reinforcement learning. This results in capabilities like locating personal information from CVs and better performance on document spatial grounding and understanding benchmarks.

What carries the argument

Synthetic data generation pipelines that produce documents paired with fine-grained coordinate metadata, combined with a reinforcement learning training paradigm that jointly optimizes grounding and reasoning.

If this is right

  • The model develops real-world functions such as locating personal information from CVs beyond single-text localization.
  • Joint reinforcement learning supervision increases the contribution of grounded evidence to overall reasoning.
  • The approach enables further optimization specifically for Document VQA tasks.
  • Comprehensive benchmark evaluations demonstrate advantages in both document spatial grounding and document understanding.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Accurate grounding could directly support document error detection by supplying reliable element locations for verification.
  • Synthetic data pipelines may address scarcity issues in training data for other multimodal grounding problems.
  • Joint grounding-reasoning training might reduce reliance on ungrounded inferences in document-based chains of reasoning.
  • The method could integrate into production document processing pipelines where precise spatial references matter.

Load-bearing premise

The synthetic data generation pipelines produce document images whose distribution is close enough to real-world cases that benchmark gains transfer to actual use.

What would settle it

Evaluating the model on a large set of real-world scanned documents with manual ground-truth coordinates and comparing precision to existing LMMs; if no improvement or degradation occurs, the claim fails.

Figures

Figures reproduced from arXiv: 2606.24118 by Chuangxin Zhao, Ji Qi, Juanzi Li, Kai Sun, Lei Hou, Yijian Lu.

Figure 1
Figure 1. Figure 1: Illustration of Section 3 and Section 4. The upper part of the figure presents the two [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
read the original abstract

Visual grounding in documents is a crucial ability for Large Multimodal Models (LMMs) in areas such as document understanding, deep research and document error detection. However, existing approaches exhibit poor grounding precision in text-rich document images, often failing to accurately locate the critical document elements needed for reliable reasoning. To address this gap, we introduce PreciseDoc, an LMM specifically designed for precise element grounding and can be further optimized for Document VQA tasks. Specifically, to enhance the basic localization capability, we construct challenging training data by two pipelines capable of mass-producing high-quality documents with paired metadata of fine-grained coordinates, including synthetic hand-filled documents with camera effects. The model develops more real-world functions beyond straightforward localization of single text, such as locating personal information from CVs. Furthermore, we introduce a training paradigm for visual grounded reasoning where the grounding and reasoning are supervised jointly with reinforcement learning to improve the contribution of the grounded evidence. A comprehensive evaluation on various benchmarks demonstrates the advantage of the proposed data and methods in document spatial grounding and document understanding.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The manuscript introduces PreciseDoc, an LMM for precise element grounding in document images. It constructs training data via two synthetic pipelines (hand-filled documents plus camera effects) that produce paired fine-grained coordinate metadata, and proposes a reinforcement-learning training paradigm that jointly supervises grounding and reasoning. The central claim is that these data and methods yield advantages in document spatial grounding and document understanding, as shown by comprehensive evaluation on various benchmarks.

Significance. If the reported benchmark gains are real and the synthetic data generalizes, the work would address a recognized weakness of current LMMs on text-rich documents and supply a scalable route to high-precision localization metadata. The joint RL supervision of grounding and reasoning is a plausible direction for improving evidence-based reasoning.

major comments (2)
  1. [Abstract] Abstract: the assertion that 'a comprehensive evaluation on various benchmarks demonstrates the advantage of the proposed data and methods' supplies no quantitative metrics, baseline comparisons, ablation results, or error analysis. Without these, the central empirical claim cannot be evaluated.
  2. [Abstract] Abstract (synthetic data pipelines): no distribution-distance metric, realism score, or ablation on real vs. synthetic test splits is reported to verify that the hand-filled documents plus camera effects reproduce the statistics of the real document images appearing in the held-out benchmarks. This is load-bearing for the transfer claim.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed feedback on the abstract. We will revise the abstract to better support the central claims with quantitative highlights from the full manuscript while preserving its concise nature. The full paper already contains the requested evaluations, baselines, and ablations in the experiments section.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the assertion that 'a comprehensive evaluation on various benchmarks demonstrates the advantage of the proposed data and methods' supplies no quantitative metrics, baseline comparisons, ablation results, or error analysis. Without these, the central empirical claim cannot be evaluated.

    Authors: We agree the abstract would be strengthened by including key quantitative results. The manuscript body reports comprehensive evaluations across multiple benchmarks, including specific metrics (e.g., grounding precision, VQA accuracy), baseline comparisons, ablations on data pipelines and RL training, and error analyses. We will revise the abstract to incorporate representative numbers, baseline references, and mention of ablations to make the empirical claim self-contained at the summary level. revision: yes

  2. Referee: [Abstract] Abstract (synthetic data pipelines): no distribution-distance metric, realism score, or ablation on real vs. synthetic test splits is reported to verify that the hand-filled documents plus camera effects reproduce the statistics of the real document images appearing in the held-out benchmarks. This is load-bearing for the transfer claim.

    Authors: The synthetic pipelines are explicitly designed to produce realistic variations via hand-filled content and camera effects to bridge the domain gap. While the current manuscript does not report explicit distribution-distance metrics (e.g., FID) or dedicated real-vs-synthetic ablations, the transfer is validated empirically through strong performance gains on held-out real-world benchmarks. We will add a brief discussion in the revised manuscript on the design rationale for realism and note the empirical evidence of generalization; a full quantitative distribution analysis would require additional experiments beyond the current scope. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical modeling contribution with independent evaluation

full rationale

The paper describes construction of synthetic training data via two pipelines, a joint grounding+reasoning training paradigm using reinforcement learning, and benchmark evaluations. No equations, fitted parameters renamed as predictions, self-definitional steps, or load-bearing self-citations appear in the abstract or described content. Claims rest on experimental results rather than any derivation chain that reduces to its own inputs by construction. This is a standard empirical LMM paper; the reader's assessment of score 2 is consistent with minor self-citation tolerance but 0 is appropriate here as no circular patterns are exhibited.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on two domain assumptions: that synthetic documents with camera effects match real distributions, and that joint RL supervision of grounding and reasoning yields better evidence use than separate training. No free parameters or invented physical entities are described in the abstract.

axioms (2)
  • domain assumption Synthetic hand-filled documents with camera effects approximate the distribution of real document images
    Invoked to justify mass production of training data for real-world functions
  • domain assumption Joint reinforcement learning of grounding and reasoning improves the contribution of grounded evidence
    Core of the proposed training paradigm

pith-pipeline@v0.9.1-grok · 5720 in / 1258 out tokens · 21453 ms · 2026-06-26T01:49:50.688349+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

42 extracted references · 22 canonical work pages · 12 internal anchors

  1. [1]

    2026 , bdsk-url-1 =

    Meng Cao, Haoze Zhao, Can Zhang, Xiaojun Chang, Ian Reid, and Xiaodan Liang. Ground- r1: Incentivizing grounded visual reasoning via reinforcement learning.arXiv preprint arXiv:2505.20272, 2025

  2. [2]

    Revisiting referring expression comprehension evaluation in the era of large multimodal models

    Jierun Chen, Fangyun Wei, Jinjing Zhao, Sizhe Song, Bohuai Wu, Zhuoxuan Peng, S-H Gary Chan, and Hongyang Zhang. Revisiting referring expression comprehension evaluation in the era of large multimodal models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 513–524, 2025

  3. [3]

    How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source suites.Science China Information Sciences, 67(12):220101, 2024

    Zhe Chen, Weiyun Wang, Hao Tian, Shenglong Ye, Zhangwei Gao, Erfei Cui, Wenwen Tong, Kongzhi Hu, Jiapeng Luo, Zheng Ma, et al. How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source suites.Science China Information Sciences, 67(12):220101, 2024

  4. [4]

    PaddleOCR-VL-1.5: Towards a Multi-Task 0.9B VLM for Robust In-the-Wild Document Parsing

    Cheng Cui, Ting Sun, Suyin Liang, Tingquan Gao, Zelun Zhang, Jiaxuan Liu, Xueqing Wang, Changda Zhou, Hongen Liu, Manhui Lin, Yue Zhang, Yubo Zhang, Yi Liu, Dianhai Yu, and Yanjun Ma. Paddleocr-vl-1.5: Towards a multi-task 0.9b vlm for robust in-the-wild document parsing, 2026. URLhttps://arxiv.org/abs/2601.21957

  5. [5]

    Xiaoyi Dong, Pan Zhang, Yuhang Zang, Yuhang Cao, Bin Wang, Linke Ouyang, Songyang Zhang, Haodong Duan, Wenwei Zhang, Yining Li, et al. Internlm-xcomposer2-4khd: A pioneering large vision-language model handling resolutions from 336 pixels to 4k hd.Advances in Neural Information Processing Systems, 37:42566–42592, 2024

  6. [6]

    GRIT: Teaching MLLMs to Think with Images

    Yue Fan, Xuehai He, Diji Yang, Kaizhi Zheng, Ching-Chen Kuo, Yuting Zheng, Sravana Jyothi Narayanaraju, Xinze Guan, and Xin Eric Wang. Grit: Teaching mllms to think with images. arXiv preprint arXiv:2505.15879, 2025

  7. [7]

    ColPali: Efficient Document Retrieval with Vision Language Models

    Manuel Faysse, Hugues Sibille, Tony Wu, Bilel Omrani, Gautier Viaud, Céline Hudelot, and Pierre Colombo. Colpali: Efficient document retrieval with vision language models.arXiv preprint arXiv:2407.01449, 2024

  8. [8]

    Unidoc: A universal large multimodal model for simultaneous text detection, recognition, spotting and understanding.arXiv preprint arXiv:2308.11592, 2023

    Hao Feng, Zijian Wang, Jingqun Tang, Jinghui Lu, Wengang Zhou, Houqiang Li, and Can Huang. Unidoc: A universal large multimodal model for simultaneous text detection, recognition, spotting and understanding.arXiv preprint arXiv:2308.11592, 2023

  9. [9]

    Boundingdocs: a unified dataset for document question answering with spatial annotations: S

    Simone Giovannini, Fabio Coppini, Andrea Gemelli, and Simone Marinai. Boundingdocs: a unified dataset for document question answering with spatial annotations: S. giovannini et al. International Journal on Document Analysis and Recognition (IJDAR), pages 1–16, 2025

  10. [10]

    Nist special database 19 handprinted forms and characters database

    Patrick Grother. Nist special database 19 handprinted forms and characters database. 1995. URLhttps://api.semanticscholar.org/CorpusID:59785963

  11. [11]

    Dong Guo, Faming Wu, Feida Zhu, Fuxing Leng, Guang Shi, Haobin Chen, Haoqi Fan, Jian Wang, Jianyu Jiang, Jiawei Wang, et al. Seed1. 5-vl technical report.arXiv preprint arXiv:2505.07062, 2025

  12. [12]

    GLM-4.5V and GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning

    Wenyi Hong, Wenmeng Yu, Xiaotao Gu, Guo Wang, Guobing Gan, Haomiao Tang, Jiale Cheng, Ji Qi, Junhui Ji, Lihang Pan, et al. Glm-4.5 v and glm-4.1 v-thinking: Towards versatile multimodal reasoning with scalable reinforcement learning.arXiv preprint arXiv:2507.01006, 2025

  13. [13]

    mplug-docowl 1.5: Unified structure learning for ocr-free document understanding

    Anwen Hu, Haiyang Xu, Jiabo Ye, Ming Yan, Liang Zhang, Bo Zhang, Ji Zhang, Qin Jin, Fei Huang, and Jingren Zhou. mplug-docowl 1.5: Unified structure learning for ocr-free document understanding. InFindings of the Association for Computational Linguistics: EMNLP 2024, pages 3096–3120, 2024

  14. [14]

    Referitgame: Referring to objects in photographs of natural scenes

    Sahar Kazemzadeh, Vicente Ordonez, Mark Matten, and Tamara Berg. Referitgame: Referring to objects in photographs of natural scenes. InProceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pages 787–798, 2014. 11

  15. [15]

    H. W. Kuhn. The hungarian method for the assignment problem.Naval Research Logistics Quarterly, 2(1-2):83–97, 1955. doi: https://doi.org/10.1002/nav.3800020109. URL https: //onlinelibrary.wiley.com/doi/abs/10.1002/nav.3800020109

  16. [16]

    Towards visual text grounding of multimodal large language model.arXiv preprint arXiv:2504.04974, 2025

    Ming Li, Ruiyi Zhang, Jian Chen, Chenguang Wang, Jiuxiang Gu, Yufan Zhou, Franck Dernon- court, Wanrong Zhu, Tianyi Zhou, and Tong Sun. Towards visual text grounding of multimodal large language model.arXiv preprint arXiv:2504.04974, 2025

  17. [17]

    Monkey: Image resolution and text label are important things for large multi-modal models

    Zhang Li, Biao Yang, Qiang Liu, Zhiyin Ma, Shuo Zhang, Jingxu Yang, Yabo Sun, Yuliang Liu, and Xiang Bai. Monkey: Image resolution and text label are important things for large multi-modal models. Inproceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 26763–26773, 2024

  18. [18]

    Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023

  19. [19]

    Textmon- key: An ocr-free large multimodal model for understanding document.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2026

    Yuliang Liu, Biao Yang, Qiang Liu, Zhang Li, Zhiyin Ma, Shuo Zhang, and Xiang Bai. Textmon- key: An ocr-free large multimodal model for understanding document.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2026

  20. [20]

    Vision- reasoner: Unified visual perception and reasoning via reinforcement learning.arXiv preprint arXiv:2505.12081, 2025

    Yuqi Liu, Tianyuan Qu, Zhisheng Zhong, Bohao Peng, Shu Liu, Bei Yu, and Jiaya Jia. Vision- reasoner: Unified visual perception and reasoning via reinforcement learning.arXiv preprint arXiv:2505.12081, 2025

  21. [21]

    Chain-of-spot: Interactive reasoning improves large vision-language models.arXiv preprint arXiv:2403.12966,

    Zuyan Liu, Yuhao Dong, Yongming Rao, Jie Zhou, and Jiwen Lu. Chain-of-spot: Interactive reasoning improves large vision-language models.arXiv preprint arXiv:2403.12966, 2024

  22. [22]

    Kam-cot: Knowledge augmented multimodal chain-of-thoughts reasoning

    Debjyoti Mondal, Suraj Modi, Subhadarshi Panda, Rituraj Singh, and Godawari Sudhakar Rao. Kam-cot: Knowledge augmented multimodal chain-of-thoughts reasoning. InProceedings of the AAAI conference on artificial intelligence, volume 38, pages 18798–18806, 2024

  23. [23]

    Towards understanding visual grounding in visual language models.arXiv preprint arXiv:2509.10345, 2025

    Georgios Pantazopoulos and Eda B Özyi˘git. Towards understanding visual grounding in visual language models.arXiv preprint arXiv:2509.10345, 2025

  24. [24]

    Dip-r1: Deep inspection and perception with rl looking through and understanding complex scenes.arXiv preprint arXiv:2505.23179, 2025

    Sungjune Park, Hyunjun Kim, Junho Kim, Seongho Kim, and Yong Man Ro. Dip-r1: Deep inspection and perception with rl looking through and understanding complex scenes.arXiv preprint arXiv:2505.23179, 2025

  25. [25]

    Cogcom: Train large vision-language models diving into details through chain of manipulations

    Ji Qi, Ming Ding, Weihan Wang, Yushi Bai, Qingsong Lv, Wenyi Hong, Bin Xu, Lei Hou, Juanzi Li, Yuxiao Dong, et al. Cogcom: Train large vision-language models diving into details through chain of manipulations. 2024

  26. [26]

    Referring expression comprehension: A survey of methods and datasets.IEEE Transactions on Multimedia, 23:4426–4440, 2020

    Yanyuan Qiao, Chaorui Deng, and Qi Wu. Referring expression comprehension: A survey of methods and datasets.IEEE Transactions on Multimedia, 23:4426–4440, 2020

  27. [27]

    Qwen3.5: Towards native multimodal agents, February 2026

    Qwen Team. Qwen3.5: Towards native multimodal agents, February 2026. URL https: //qwen.ai/blog?id=qwen3.5

  28. [28]

    Hao Shao, Shengju Qian, Han Xiao, Guanglu Song, Zhuofan Zong, Letian Wang, Yu Liu, and Hongsheng Li. Visual cot: Advancing multi-modal language models with a comprehen- sive dataset and benchmark for chain-of-thought reasoning.Advances in Neural Information Processing Systems, 37:8612–8642, 2024

  29. [29]

    OpenAI GPT-5 System Card

    Aaditya Singh, Adam Fry, Adam Perelman, Adam Tart, Adi Ganesh, Ahmed El-Kishky, Aidan McLaughlin, Aiden Low, AJ Ostrow, Akhila Ananthram, et al. Openai gpt-5 system card.arXiv preprint arXiv:2601.03267, 2025

  30. [30]

    MIT press Cambridge, 1998

    Richard S Sutton, Andrew G Barto, et al.Reinforcement learning: An introduction, volume 1. MIT press Cambridge, 1998

  31. [31]

    Gemini: A Family of Highly Capable Multimodal Models

    Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805, 2023. 12

  32. [32]

    MinerU: An Open-Source Solution for Precise Document Content Extraction

    Bin Wang, Chao Xu, Xiaomeng Zhao, Linke Ouyang, Fan Wu, Zhiyuan Zhao, Rui Xu, Kaiwen Liu, Yuan Qu, Fukai Shang, et al. Mineru: An open-source solution for precise document content extraction.arXiv preprint arXiv:2409.18839, 2024

  33. [33]

    Traceable evidence enhanced visual grounded reasoning: Evaluation and methodology.arXiv preprint arXiv:2507.07999, 2025

    Haochen Wang, Xiangtai Li, Zilong Huang, Anran Wang, Jiacong Wang, Tao Zhang, Jiani Zheng, Sule Bai, Zijian Kang, Jiashi Feng, et al. Traceable evidence enhanced visual grounded reasoning: Evaluation and methodology.arXiv preprint arXiv:2507.07999, 2025

  34. [34]

    VGR: Visual Grounded Reasoning

    Jiacong Wang, Zijian Kang, Haochen Wang, Haiyong Jiang, Jiawen Li, Bohong Wu, Ya Wang, Jiao Ran, Xiao Liang, Chao Feng, et al. Vgr: Visual grounded reasoning.arXiv preprint arXiv:2506.11991, 2025

  35. [35]

    Ureader: Universal ocr-free visually-situated language understand- ing with multimodal large language model

    Jiabo Ye, Anwen Hu, Haiyang Xu, Qinghao Ye, Ming Yan, Guohai Xu, Chenliang Li, Junfeng Tian, Qi Qian, Ji Zhang, et al. Ureader: Universal ocr-free visually-situated language understand- ing with multimodal large language model. InFindings of the Association for Computational Linguistics: EMNLP 2023, pages 2841–2858, 2023

  36. [36]

    mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality

    Qinghao Ye, Haiyang Xu, Guohai Xu, Jiabo Ye, Ming Yan, Yiyang Zhou, Junyang Wang, Anwen Hu, Pengcheng Shi, Yaya Shi, et al. mplug-owl: Modularization empowers large language models with multimodality.arXiv preprint arXiv:2304.14178, 2023

  37. [37]

    Modeling context in referring expressions

    Licheng Yu, Patrick Poirson, Shan Yang, Alexander C Berg, and Tamara L Berg. Modeling context in referring expressions. InEuropean conference on computer vision, pages 69–85. Springer, 2016

  38. [38]

    Bbox docvqa: A large scale bounding box grounded dataset for enhancing reasoning in document visual question answer.arXiv preprint arXiv:2511.15090, 2025

    Wenhan Yu, Wang Chen, Guanqiang Qi, Weikang Li, Yang Li, Lei Sha, Deguo Xia, and Jizhou Huang. Bbox docvqa: A large scale bounding box grounded dataset for enhancing reasoning in document visual question answer.arXiv preprint arXiv:2511.15090, 2025

  39. [39]

    Referring expression comprehension with semantic visual relationship and word mapping

    Chao Zhang, Weiming Li, Wanli Ouyang, Qiang Wang, Woo-Shik Kim, and Sunghoon Hong. Referring expression comprehension with semantic visual relationship and word mapping. In Proceedings of the 27th ACM International Conference on Multimedia, pages 1258–1266, 2019

  40. [40]

    DeepEyes: Incentivizing "Thinking with Images" via Reinforcement Learning

    Ziwei Zheng, Michael Yang, Jack Hong, Chenxiao Zhao, Guohai Xu, Le Yang, Chao Shen, and Xing Yu. Deepeyes: Incentivizing" thinking with images" via reinforcement learning.arXiv preprint arXiv:2505.14362, 2025

  41. [41]

    Dogr: Towards versatile visual document grounding and referring

    Yinan Zhou, Yuxin Chen, Haokun Lin, Yichen Wu, Shuyu Yang, Zhongang Qi, Chen Ma, and Li Zhu. Dogr: Towards versatile visual document grounding and referring. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 3596–3606, 2025

  42. [42]

    InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

    Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shenglong Ye, Lixin Gu, Hao Tian, Yuchen Duan, Weijie Su, Jie Shao, et al. Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models.arXiv preprint arXiv:2504.10479, 2025. 13 A Training Details The base model we use is GLM-4.6V-9B [12]. The whole training process consist...