pith. sign in

arxiv: 2605.30265 · v1 · pith:UPYJA4WKnew · submitted 2026-05-28 · 💻 cs.CV · cs.CL

LoMo: Local Modality Substitution for Deeper Vision-Language Fusion

Pith reviewed 2026-06-29 08:17 UTC · model grok-4.3

classification 💻 cs.CV cs.CL
keywords vision-language modelsmultimodal fusionmodality substitutiondata curationcross-modal invariancecarrier sensitivityLoMo
0
0 comments X

The pith

Replacing selected text spans with their rendered images during training makes VLMs treat equivalent content the same across modalities.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Vision-language models show sharp performance drops when a textual question is replaced by its visual rendering, even though the meaning stays identical. The paper traces this carrier sensitivity to training data that assigns text and images asymmetric roles, such as queries versus references. Local Modality Substitution counters the bias by dynamically picking text spans inside prompts and turning them into images, creating interleaved sequences that supervise cross-modal invariance. Training with these sequences produces deeper fusion without changing the model architecture. Gains appear consistently across multiple base models and 13 benchmarks.

Core claim

Current training corpora organize text and images into distinct roles that cause VLMs to develop modality-specific preferences, so they fail to align representations of the same semantics when the carrier changes. LoMo corrects this by reformulating single-modality prompts into text-visual-text sequences: it selects local text spans and recasts them as rendered images while preserving exact semantics. The resulting supervision signal teaches representational invariance between carriers.

What carries the argument

Local Modality Substitution (LoMo), a data curation step that selects target text spans inside prompts and replaces them with rendered images to form interleaved multimodal training sequences.

If this is right

  • Consistent accuracy gains on 13 multimodal benchmarks.
  • Improvements of 2.67 points over standard supervised fine-tuning on LLaVA-OneVision-1.5-8B.
  • Improvements of 2.82 points over standard supervised fine-tuning on Qwen3.5-9B.
  • Deeper cross-modal fusion that is architecture-agnostic.
  • Robustness to modality substitution on both understanding and reasoning tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same local-substitution pattern could be applied when curating data for other modality pairs such as audio-text.
  • Future datasets might be collected with balanced carrier roles from the start instead of relying on post-hoc fixes.
  • Models trained this way may handle mixed-modality inputs more gracefully in open-ended settings.

Load-bearing premise

The performance drop when text is swapped for images is caused primarily by asymmetric roles in existing training data, and rendering those text spans as images preserves semantics without adding new artifacts or shifts.

What would settle it

After LoMo training, measure the accuracy gap between original text prompts and their image-rendered counterparts on held-out benchmarks; a near-zero gap would support the claim while a persistent gap would falsify it.

Figures

Figures reproduced from arXiv: 2605.30265 by Feng Han, Jiaqi Wang, Yibin Wang, Zheming Liang, Zhixiong Zhang.

Figure 1
Figure 1. Figure 1: Current Vision-Language Models exhibit carrier sensitivity driven by an underlying modality gap. (a) Carrier sensitivity across VLMs. Simply shifting identical semantic content from a text format to a visual format (rendering standard questions as images) causes consistent and significant accuracy drops across state-of-the-art models. (b) The physical manifestation of the modality gap. By measuring the pai… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of LoMo. LoMo reformulates a text-only instance into a text–image interleaved sequence through three stages. Structure-Aware Span Localization chunks the input in a formula￾aware manner and selects a semantically coherent middle span as the target for visualization. Visual Rendering converts the target span into an image via content-aware routing between LaTeX and standard text renderers. The imag… view at source ↗
Figure 3
Figure 3. Figure 3: LoMo yields consistent improvements over Standard SFT across two backbones ( [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: LoMo consistently outperforms standard SFT across data scales on three metrics. (a) Average accuracy on 13 multimodal benchmarks (higher is better). (b) Mean MIR [12], measuring text-visual representation alignment (lower is better). (c) Pairwise cross-modal distance, defined as 1 − cos(h¯ text, h¯ img) between hidden states of text and its rendered image (lower is better). semantically equivalent text and… view at source ↗
read the original abstract

Vision-Language Models (VLMs) have achieved substantial progress across a wide range of understanding and reasoning tasks, driven by large-scale image-text training aimed at multimodal fusion. Ideally, replacing a textual question with its rendered-image counterpart should leave model performance essentially unaffected. In practice, however, such modality substitution induces dramatic performance degradation. We attribute this "carrier sensitivity" issue to an inherent bias in current training corpora. Across prevalent datasets such as image captioning, VQA, OCR, and web-sourced interleaved data, text and images are typically organized into distinct and asymmetric roles, with text serving as linguistic queries and images as visual references. Such data bias leads VLMs to exhibit distinct preferences for information acquisition across different modalities. Consequently, VLMs fail to align representations of semantically equivalent content across textual and visual carriers, making model reasoning fragile under modality substitution. To address this, we propose Local Modality Substitution (LoMo), a lightweight, architecture-agnostic data curation paradigm designed to provide supervision for cross-modal representational invariance between semantically equivalent text and image carriers. LoMo achieves this by reformulating single-modality prompts into seamlessly interleaved multimodal sequences. It dynamically selects target text spans and recasts them as rendered images, thereby preserving the same semantics across "text, visual, text" carriers. Extensive experiments across 13 diverse multimodal benchmarks demonstrate that LoMo significantly improves overall multimodal reasoning and yields deeper cross-modal fusion. Specifically, it delivers consistent gains across foundational models, improving over standard SFT by 2.67 points on LLaVA-OneVision-1.5-8B and 2.82 points on Qwen3.5-9B.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The manuscript proposes Local Modality Substitution (LoMo), a data curation paradigm that dynamically selects text spans in single-modality prompts and renders them as images to produce interleaved multimodal sequences. This is intended to counteract 'carrier sensitivity'—performance degradation under text-to-image substitution—attributed to asymmetric text/image roles in existing corpora (captioning, VQA, OCR, interleaved data). The method is architecture-agnostic and is claimed to yield deeper cross-modal fusion, with reported gains of 2.67 points on LLaVA-OneVision-1.5-8B and 2.82 points on Qwen3.5-9B over standard SFT across 13 benchmarks.

Significance. If the reported gains can be shown to arise specifically from improved representational invariance rather than rendering artifacts or other data effects, LoMo would supply a lightweight, training-only intervention that strengthens VLM robustness without architectural modification. The architecture-agnostic framing and focus on local substitution are practical strengths that could influence data curation practices for multimodal training.

major comments (3)
  1. [Abstract] Abstract: The central empirical claims (consistent gains of 2.67 and 2.82 points, deeper fusion across 13 benchmarks) are stated without any description of experimental protocol, baseline implementations, training details, number of runs, or statistical tests. This absence directly affects verifiability of the performance improvements that support the invariance hypothesis.
  2. [Methods] Methods / LoMo description: The claim that dynamic rendering of selected text spans supplies 'clean supervision for cross-modal representational invariance' rests on the untested premise that rendered images preserve semantics without introducing new distributional shifts (font, layout, resolution, or visual artifacts). No controls, render-fidelity ablations, or direct measurements of representation alignment (e.g., before/after cosine similarity or probing) are referenced to rule out alternative explanations such as incidental robustness to rendered text.
  3. [Introduction] Introduction / motivation: The attribution of carrier sensitivity to 'inherent bias' and 'distinct and asymmetric roles' in prevalent datasets is presented without quantitative corpus analysis, examples of role asymmetry, or measurements showing how such asymmetry produces the observed substitution degradation. This leaves the causal premise for LoMo unsupported by evidence internal to the manuscript.
minor comments (1)
  1. [Abstract] The model name 'Qwen3.5-9B' should be verified against the exact release used; minor notation inconsistencies of this type reduce clarity.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the thoughtful and constructive feedback. The comments highlight areas where additional clarity and evidence would strengthen the manuscript. We address each major comment below and commit to revisions that improve verifiability and support for the core claims without altering the overall contribution.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central empirical claims (consistent gains of 2.67 and 2.82 points, deeper fusion across 13 benchmarks) are stated without any description of experimental protocol, baseline implementations, training details, number of runs, or statistical tests. This absence directly affects verifiability of the performance improvements that support the invariance hypothesis.

    Authors: We agree that the abstract is concise and omits these details. The full experimental protocol, including model specifications, training hyperparameters, baseline comparisons, and evaluation across the 13 benchmarks, is provided in Section 4. To address verifiability concerns, we will revise the abstract to include a short clause noting the two base models evaluated and the breadth of benchmarks, while remaining within length limits. We will also add explicit statements on run counts and variance in the Experiments section. revision: partial

  2. Referee: [Methods] Methods / LoMo description: The claim that dynamic rendering of selected text spans supplies 'clean supervision for cross-modal representational invariance' rests on the untested premise that rendered images preserve semantics without introducing new distributional shifts (font, layout, resolution, or visual artifacts). No controls, render-fidelity ablations, or direct measurements of representation alignment (e.g., before/after cosine similarity or probing) are referenced to rule out alternative explanations such as incidental robustness to rendered text.

    Authors: This observation is correct; the current manuscript does not include render-fidelity ablations or direct alignment measurements. We will add a dedicated ablation subsection (and appendix) that varies rendering parameters such as font style, resolution, and layout, reports performance sensitivity, and includes cosine similarity probes between original text embeddings and rendered-image embeddings for matched spans. These additions will directly test whether gains arise from invariance rather than artifacts. revision: yes

  3. Referee: [Introduction] Introduction / motivation: The attribution of carrier sensitivity to 'inherent bias' and 'distinct and asymmetric roles' in prevalent datasets is presented without quantitative corpus analysis, examples of role asymmetry, or measurements showing how such asymmetry produces the observed substitution degradation. This leaves the causal premise for LoMo unsupported by evidence internal to the manuscript.

    Authors: We acknowledge that the introduction currently relies on qualitative description rather than quantitative corpus evidence. In the revised manuscript we will insert a short quantitative analysis, including role-distribution statistics across representative datasets (e.g., captioning, VQA, OCR) and concrete examples of substitution degradation on held-out prompts. This will provide internal support for the motivation before presenting LoMo. revision: yes

Circularity Check

0 steps flagged

No circularity: method is an independent data-curation procedure with no equations, fits, or self-citation chains.

full rationale

The paper describes a heuristic curation step (select text spans, render as images, interleave) motivated by an observed performance drop under modality substitution. No equations, parameters, or predictions are defined; the claimed gains are presented as empirical outcomes of the curation rather than re-derivations of prior fitted quantities. No self-citations are invoked as load-bearing uniqueness theorems or ansatzes. The derivation chain is therefore self-contained and does not reduce to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that training corpora induce carrier-specific preferences and that local image rendering of text spans supplies the needed invariance signal.

axioms (1)
  • domain assumption Text and images occupy asymmetric roles in prevalent training corpora (captioning, VQA, OCR, interleaved web data), inducing distinct modality preferences.
    Explicitly stated in the abstract as the root cause of carrier sensitivity.

pith-pipeline@v0.9.1-grok · 5844 in / 1134 out tokens · 29312 ms · 2026-06-29T08:17:57.602699+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

46 extracted references · 17 canonical work pages · 10 internal anchors

  1. [1]

    LLaVA-OneVision-1.5: Fully Open Framework for Democratized Multimodal Training

    Xiang An, Yin Xie, Kaicheng Yang, Wenkang Zhang, Xiuwei Zhao, Zheng Cheng, Yirui Wang, Songcen Xu, Changrui Chen, Didi Zhu, et al. Llava-onevision-1.5: Fully open framework for democratized multimodal training.arXiv preprint arXiv:2509.23661, 2025

  2. [2]

    Qwen3-VL Technical Report

    Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025

  3. [3]

    Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian...

  4. [4]

    Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks

    Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 24185–24198, 2024

  5. [5]

    Glyph: Scaling context windows via visual-text compression

    Jiale Cheng, Yusen Liu, Xinyu Zhang, Yulin Fei, Wenyi Hong, Ruiliang Lyu, Weihan Wang, Zhe Su, Xiaotao Gu, Xiao Liu, et al. Glyph: Scaling context windows via visual-text compression. arXiv preprint arXiv:2510.17800, 2025

  6. [6]

    Simplevqa: Multimodal factuality evaluation for multimodal large language models

    Xianfu Cheng, Wei Zhang, Shiwei Zhang, Jian Yang, Xiangyuan Guan, Xianjie Wu, Xiang Li, Ge Zhang, Jiaheng Liu, Yuying Mai, et al. Simplevqa: Multimodal factuality evaluation for multimodal large language models. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 4637–4646, 2025

  7. [7]

    Training Verifiers to Solve Math Word Problems

    Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021

  8. [8]

    Mm-ifengine: Towards multimodal instruction following

    Shengyuan Ding, Shenxi Wu, Xiangyu Zhao, Yuhang Zang, Haodong Duan, Xiaoyi Dong, Pan Zhang, Yuhang Cao, Dahua Lin, and Jiaqi Wang. Mm-ifengine: Towards multimodal instruction following. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 1099–1109, 2025

  9. [9]

    Insight-v: Exploring long-chain visual reasoning with multimodal large language models

    Yuhao Dong, Zuyan Liu, Hai-Long Sun, Jingkang Yang, Winston Hu, Yongming Rao, and Ziwei Liu. Insight-v: Exploring long-chain visual reasoning with multimodal large language models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 9062–9072, 2025

  10. [10]

    Hallusionbench: an advanced diagnostic suite for entangled language hallucination and visual illusion in large vision-language models

    Tianrui Guan, Fuxiao Liu, Xiyang Wu, Ruiqi Xian, Zongxia Li, Xiaoyu Liu, Xijun Wang, Lichang Chen, Furong Huang, Yaser Yacoob, et al. Hallusionbench: an advanced diagnostic suite for entangled language hallucination and visual illusion in large vision-language models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pag...

  11. [11]

    GLM-4.5V and GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning

    Wenyi Hong, Wenmeng Yu, Xiaotao Gu, Guo Wang, Guobing Gan, Haomiao Tang, Jiale Cheng, Ji Qi, Junhui Ji, Lihang Pan, et al. Glm-4.5 v and glm-4.1 v-thinking: Towards versatile multimodal reasoning with scalable reinforcement learning.arXiv preprint arXiv:2507.01006, 2025. 10

  12. [12]

    Deciphering cross-modal alignment in large vision-language models with modality integration rate.arXiv preprint arXiv:2410.07167, 2024

    Qidong Huang, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Yuhang Cao, Jiaqi Wang, Dahua Lin, Weiming Zhang, and Nenghai Yu. Deciphering cross-modal alignment in large vision-language models with modality integration rate.arXiv preprint arXiv:2410.07167, 2024

  13. [13]

    Livecodebench: Holistic and contamination free evaluation of large language models for code

    Naman Jain, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando Solar- Lezama, Koushik Sen, and Ion Stoica. Livecodebench: Holistic and contamination free evaluation of large language models for code. InInternational Conference on Learning Repre- sentations, volume 2025, pages 58791–58831, 2025

  14. [14]

    Mantis: Interleaved multi-image instruction tuning.arXiv preprint arXiv:2405.01483, 2024

    Dongfu Jiang, Xuan He, Huaye Zeng, Cong Wei, Max Ku, Qian Liu, and Wenhu Chen. Mantis: Interleaved multi-image instruction tuning.arXiv preprint arXiv:2405.01483, 2024

  15. [15]

    Multilingual pretraining for pixel language models

    Ilker Kesen, Jonas F Lotz, Ingo Ziegler, Phillip Rust, and Desmond Elliott. Multilingual pretraining for pixel language models. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 29582–29599, 2025

  16. [16]

    Pix2struct: Screenshot parsing as pretraining for visual language understanding

    Kenton Lee, Mandar Joshi, Iulia Raluca Turc, Hexiang Hu, Fangyu Liu, Julian Martin Eisensch- los, Urvashi Khandelwal, Peter Shaw, Ming-Wei Chang, and Kristina Toutanova. Pix2struct: Screenshot parsing as pretraining for visual language understanding. InInternational Confer- ence on Machine Learning, pages 18893–18912. PMLR, 2023

  17. [17]

    Mitigating object hallucinations in large vision-language models through visual contrastive decoding

    Sicong Leng, Hang Zhang, Guanzheng Chen, Xin Li, Shijian Lu, Chunyan Miao, and Lidong Bing. Mitigating object hallucinations in large vision-language models through visual contrastive decoding. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13872–13882, 2024

  18. [18]

    SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension

    Bohao Li, Rui Wang, Guangzhi Wang, Yuying Ge, Yixiao Ge, and Ying Shan. Seed- bench: Benchmarking multimodal llms with generative comprehension.arXiv preprint arXiv:2307.16125, 2023

  19. [19]

    Text or pixels? it takes half: On the token efficiency of visual text inputs in multimodal llms.arXiv preprint arXiv:2510.18279, 2025

    Yanhong Li, Zixuan Lan, and Jiawei Zhou. Text or pixels? it takes half: On the token efficiency of visual text inputs in multimodal llms.arXiv preprint arXiv:2510.18279, 2025

  20. [20]

    Mind the gap: Understanding the modality gap in multi-modal contrastive representation learning

    Victor Weixin Liang, Yuhui Zhang, Yongchan Kwon, Serena Yeung, and James Y Zou. Mind the gap: Understanding the modality gap in multi-modal contrastive representation learning. Advances in Neural Information Processing Systems, 35:17612–17625, 2022

  21. [21]

    Improved baselines with visual instruction tuning

    Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 26296–26306, 2024

  22. [22]

    Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023

  23. [23]

    Mmbench: Is your multi-modal model an all-around player? InEuropean conference on computer vision, pages 216–233

    Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, et al. Mmbench: Is your multi-modal model an all-around player? InEuropean conference on computer vision, pages 216–233. Springer, 2024

  24. [24]

    DeepSeek-VL: Towards Real-World Vision-Language Understanding

    Haoyu Lu, Wen Liu, Bo Zhang, Bingxuan Wang, Kai Dong, Bo Liu, Jingxiang Sun, Tongzheng Ren, Zhuoshu Li, Hao Yang, et al. Deepseek-vl: towards real-world vision-language under- standing.arXiv preprint arXiv:2403.05525, 2024

  25. [25]

    MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts

    Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts.arXiv preprint arXiv:2310.02255, 2023

  26. [26]

    Mmlongbench-doc: Benchmarking long-context document understanding with visualizations.Advances in Neural Information Processing Systems, 37:95963–96010, 2024

    Yubo Ma, Yuhang Zang, Liangyu Chen, Meiqi Chen, Yizhu Jiao, Xinze Li, Xinyuan Lu, Ziyu Liu, Yan Ma, Xiaoyi Dong, et al. Mmlongbench-doc: Benchmarking long-context document understanding with visualizations.Advances in Neural Information Processing Systems, 37:95963–96010, 2024

  27. [27]

    Docvqa: A dataset for vqa on document images

    Minesh Mathew, Dimosthenis Karatzas, and CV Jawahar. Docvqa: A dataset for vqa on document images. InProceedings of the IEEE/CVF winter conference on applications of computer vision, pages 2200–2209, 2021. 11

  28. [28]

    Teaching clip to count to ten

    Roni Paiss, Ariel Ephrat, Omer Tov, Shiran Zada, Inbar Mosseri, Michal Irani, and Tali Dekel. Teaching clip to count to ten. InProceedings of the IEEE/CVF international conference on computer vision, pages 3170–3180, 2023

  29. [29]

    Runqi Qiao, Qiuna Tan, Guanting Dong, MinhuiWu MinhuiWu, Chong Sun, Xiaoshuai Song, Jiapeng Wang, Zhuoma Gongque, Shanglin Lei, Yifan Zhang, et al. We-math: Does your large multimodal model achieve human-like mathematical reasoning? InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 200...

  30. [30]

    Zerobench: An impossible visual benchmark for contemporary large multimodal models.arXiv preprint arXiv:2502.09696, 2025

    Jonathan Roberts, Mohammad Reza Taesiri, Ansh Sharma, Akash Gupta, Samuel Roberts, Ioana Croitoru, Simion-Vlad Bogolin, Jialu Tang, Florian Langer, Vyas Raina, et al. Zerobench: An impossible visual benchmark for contemporary large multimodal models.arXiv preprint arXiv:2502.09696, 2025

  31. [31]

    Two effects, one trigger: On the modality gap, object bias, and information imbalance in contrastive vision-language models.arXiv preprint arXiv:2404.07983, 2024

    Simon Schrodi, David T Hoffmann, Max Argus, V olker Fischer, and Thomas Brox. Two effects, one trigger: On the modality gap, object bias, and information imbalance in contrastive vision-language models.arXiv preprint arXiv:2404.07983, 2024

  32. [32]

    Aligning large multimodal models with factually augmented rlhf

    Zhiqing Sun, Sheng Shen, Shengcao Cao, Haotian Liu, Chunyuan Li, Yikang Shen, Chuang Gan, Liangyan Gui, Yu-Xiong Wang, Yiming Yang, et al. Aligning large multimodal models with factually augmented rlhf. InFindings of the Association for Computational Linguistics: ACL 2024, pages 13088–13110, 2024

  33. [33]

    Cambrian-1: A fully open, vision-centric exploration of multimodal llms.Advances in Neural Information Processing Systems, 37:87310–87356, 2024

    Shengbang Tong, Ellis Brown, Penghao Wu, Sanghyun Woo, Manoj Middepogu, Sai C Akula, Jihan Yang, Shusheng Yang, Adithya Iyer, Xichen Pan, et al. Cambrian-1: A fully open, vision-centric exploration of multimodal llms.Advances in Neural Information Processing Systems, 37:87310–87356, 2024

  34. [34]

    Eyes wide shut? exploring the visual shortcomings of multimodal llms

    Shengbang Tong, Zhuang Liu, Yuexiang Zhai, Yi Ma, Yann LeCun, and Saining Xie. Eyes wide shut? exploring the visual shortcomings of multimodal llms. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9568–9578, 2024

  35. [35]

    Lever- aging visual tokens for extended text contexts in multi-modal learning.Advances in Neural Information Processing Systems, 37:14325–14348, 2024

    Alex Jinpeng Wang, Linjie Li, Yiqi Lin, Min Li, Lijuan Wang, and Mike Zheng Shou. Lever- aging visual tokens for extended text contexts in multi-modal learning.Advances in Neural Information Processing Systems, 37:14325–14348, 2024

  36. [36]

    InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

    Weiyun Wang, Zhangwei Gao, Lixin Gu, Hengjun Pu, Long Cui, Xingguang Wei, Zhaoyang Liu, Linglin Jing, Shenglong Ye, Jie Shao, et al. Internvl3. 5: Advancing open-source multimodal models in versatility, reasoning, and efficiency.arXiv preprint arXiv:2508.18265, 2025

  37. [37]

    MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark

    Yubo Wang, Xueguang Ma, Ge Zhang, Yuansheng Ni, Abhranil Chandra, Shiguang Guo, Weiming Ren, Aaran Arulraj, Xuan He, Ziyan Jiang, et al. Mmlu-pro: A more robust and challenging multi-task language understanding benchmark.arXiv preprint arXiv:2406.01574, 2024

  38. [38]

    DeepSeek-OCR: Contexts Optical Compression

    Haoran Wei, Yaofeng Sun, and Yukun Li. Deepseek-ocr: Contexts optical compression.arXiv preprint arXiv:2510.18234, 2025

  39. [39]

    V?: Guided visual search as a core mechanism in multimodal llms

    Penghao Wu and Saining Xie. V?: Guided visual search as a core mechanism in multimodal llms. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13084–13094, 2024

  40. [40]

    Vision-centric token compression in large language model.arXiv preprint arXiv:2502.00791, 2025

    Ling Xing, Alex Jinpeng Wang, Rui Yan, Xiangbo Shu, and Jinhui Tang. Vision-centric token compression in large language model.arXiv preprint arXiv:2502.00791, 2025

  41. [41]

    Vision model pre-training on interleaved image-text data via latent compression learning.Advances in Neural Information Processing Systems, 37:23912– 23938, 2024

    Chenyu Yang, Xizhou Zhu, Jinguo Zhu, Weijie Su, Junjie Wang, Xuan Dong, Wenhai Wang, Lewei Lu, Bin Li, Jie Zhou, et al. Vision model pre-training on interleaved image-text data via latent compression learning.Advances in Neural Information Processing Systems, 37:23912– 23938, 2024. 12

  42. [42]

    Cc-ocr: A comprehensive and challenging ocr benchmark for evaluating large multimodal models in literacy

    Zhibo Yang, Jun Tang, Zhaohai Li, Pengfei Wang, Jianqiang Wan, Humen Zhong, Xuejing Liu, Mingkun Yang, Peng Wang, Shuai Bai, et al. Cc-ocr: A comprehensive and challenging ocr benchmark for evaluating large multimodal models in literacy. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 21744–21754, 2025

  43. [43]

    Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi

    Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, et al. Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9556–9567, 2024

  44. [44]

    Mmmu-pro: A more robust multi-discipline multimodal understanding benchmark

    Xiang Yue, Tianyu Zheng, Yuansheng Ni, Yubo Wang, Kai Zhang, Shengbang Tong, Yuxuan Sun, Botao Yu, Ge Zhang, Huan Sun, et al. Mmmu-pro: A more robust multi-discipline multimodal understanding benchmark. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 15134–15186, 2025

  45. [45]

    Instruction-following evaluation for large language models, 2023

    Jeffrey Zhou, Tianjian Lu, Swaroop Mishra, Siddhartha Brahma, Sujoy Basu, Yi Luan, Denny Zhou, and Le Hou. Instruction-following evaluation for large language models, 2023. 13 A Pure-Text Capability Analysis To verify the effect of LoMo on pure-text capabilities, we evaluate Standard SFT and LoMo on five pure-text benchmarks: MMLU-Pro [ 37], GSM8K [7], Hu...

  46. [46]

    The resulting rendered images are inserted back into multimodal SFT examples in place of the corresponding text spans

    Rendered images are retained at their native resolution based on the length of text. The resulting rendered images are inserted back into multimodal SFT examples in place of the corresponding text spans. B.3 Training Setup We fine-tune LLaV A-OneVision-1.5-8B-Base and Qwen3.5-9B-Base using LLaMA-Factory under a standard supervised fine-tuning regime. We a...