pith. sign in

arxiv: 2606.20077 · v1 · pith:FCPUL2Z3new · submitted 2026-06-18 · 💻 cs.CV · cs.AI

The Hidden Evolution of Disguised Visual Context inside the VLM

Pith reviewed 2026-06-26 18:06 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords vision-language modelsvisual token integrationLLM internal representationsfrequency characteristicsin-context promptslayer-wise injectionmultimodal alignment
0
0 comments X

The pith

Visual tokens enter LLMs as raw signals and are progressively reshaped by integration paradigm, each capturing different frequency characteristics.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper compares treating visual tokens as in-context prompts versus injecting them into intermediate LLM layers, all under identical training conditions on single-image, multi-image, and video tasks. It shows that visual tokens begin as disguised visual context without linguistic structure and then evolve differently inside the model depending on the chosen integration method. Each method ends up emphasizing distinct frequency characteristics of the original visual signal. This internal reshaping controls which visual features become usable, how well they align with language representations, and which tasks each paradigm handles best. Attention patterns alone do not explain the performance gaps; the quality of the visual representations at each layer does.

Core claim

Visual tokens enter the LLM as disguised visual context, raw representations lacking linguistic structure, but are progressively reshaped depending on the integration paradigm, each capturing fundamentally different frequency characteristics of the visual signal. This evolution inside the LLM determines what visual features the VLM can utilize effectively, how visual representations align with the language space, and ultimately how each paradigm performs across different tasks. Attention allocation alone is insufficient, and performance is driven by the quality of visual representations at each layer.

What carries the argument

The progressive reshaping of disguised visual context inside the LLM, where raw visual tokens acquire different frequency profiles according to whether they are added as in-context prompts or injected at intermediate layers.

If this is right

  • In-context and layer-wise injection each produce visual representations tuned to different frequency characteristics of the input signal.
  • The internal evolution of visual tokens controls alignment with the language space and determines task performance.
  • Performance gaps between paradigms arise from differences in representation quality at each layer rather than attention allocation alone.
  • The same training regime reveals that each paradigm has distinct strengths on single-image, multi-image, and video benchmarks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Choosing the injection layer could be tuned to match the frequency demands of a target task.
  • The same layer-wise reshaping process may appear when other modalities such as audio are integrated into LLMs.
  • Architecture search could target specific representation qualities at particular depths rather than overall attention patterns.

Load-bearing premise

The two integration paradigms can be compared fairly when trained under identical conditions across single-image, multi-image, and video benchmarks.

What would settle it

Finding that the frequency characteristics of visual representations at corresponding layers remain identical between the two paradigms, or that performance differences disappear once representation quality is controlled for.

Figures

Figures reproduced from arXiv: 2606.20077 by Muhammad Awais, Sara Atito, Tony Alex, Wish Suharitdamrong.

Figure 1
Figure 1. Figure 1: Overview of the three VLM integration paradigms evaluated in this study. (a) IN-CT concatenates visual tokens with text tokens at the input layer, (b) LW-GC introduces visual features through gated cross-attention blocks, and (c) LW-AT injects visual features directly into the LLM’s keys and values. 3 Model Architectures and Training Data Integration Paradigms and Architectures. To conduct a fair systemati… view at source ↗
Figure 2
Figure 2. Figure 2: CKA heatmaps of image (a) and text (b) token representations across IN￾CT layers. Higher values (yellow) indicate greater representational similarity between layers. Given that IN-CT consistently out￾performs layer-wise injection, we hy￾pothesize that visual tokens in IN-CT undergo a progressive transformation through the LLM layers, similar to how text tokens are refined. To test this, we utilize Centered… view at source ↗
Figure 3
Figure 3. Figure 3: CKA heatmaps of visual token representations across all three integration paradigms. (a) IN-CT exhibits smooth progressive evolution, while (b) LW-GC and (c) LW-AT show severe discontinuities. understand how visual features behave under different integration paradigms, we compare these dynamics with LW-GC and LW-AT ( [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Relative log amplitude of Fourier-transformed visual token representations across layers. (a) IN-CT, (b) LW-GC, (c) LW-AT, and (d) all three paradigms com￾pared. clearly captures high-frequency features, with the relative log amplitude increas￾ing gradually across the early and middle layers, indicating a progressive shift toward fine-grained local details such as texture and edges. This ability to cap￾tur… view at source ↗
Figure 5
Figure 5. Figure 5: Visualization of 3D PCA projections of image and text token representations across layers. Top row: IN-CT, bottom row: LW-AT. Red points represent image tokens, blue points represent text tokens. Finding 2: IN-CT exhibits a structured frequency pattern, progressively shifting toward high-frequency features before consolidating back to low￾frequency in the final layers, while LW-GC and LW-AT show erratic an… view at source ↗
Figure 6
Figure 6. Figure 6: Visualization of attention mass allocated to visual tokens during generation across layers. (a, b) On MMBench, attention to visual tokens decreases in deeper lay￾ers. (c, d) On ChartQA, attention remains high throughout, revealing task-dependent utilization patterns. location of visual information utilized by the LLM at each layer across different integration paradigms. For IN-CT and LW-AT, we quantify vis… view at source ↗
Figure 7
Figure 7. Figure 7: Visualization of learned gate values across LW-GC layers for 700K and 1.6M training mixtures. IN-CT progressively captures high-frequency features (Finding 2), providing rich fine-grained details critical for OCR and temporal details essential for video understanding, while LW￾AT retrieves representations biased toward low￾frequency features. We further hypothesize that for OCR tasks, IN-CT also benefits f… view at source ↗
Figure 8
Figure 8. Figure 8: Left: Position ID scheme used in LW-AT compared to IN-CT. Visual tokens are assigned the same position IDs in the key-value sequence as they would occupy in IN-CT, preserving the relative positional relationship between visual and text tokens across both paradigms. Right: Architecture of LW-AT, where layer-specific connectors project visual features into the keys and values of each LLM layer, bypassing the… view at source ↗
Figure 9
Figure 9. Figure 9: Architecture overview of the hybrid IN-CT + LW-AT model compared to standalone IN-CT and LW-AT. IN-CT vision tokens are concatenated at the input, while LW-AT vision tokens are injected into the keys and values at every layer. IN-CT Vision Tokens LW-AT Vision Tokens Position IDs System Prompt Text Tokens IN-CT + LW-AT 0 2 4 6 0 1 2 3 Query Key&Value 8 10 11 4 5 6 7 8 9 10 11 LW-AT 0 5 6 0 1 2 3 4 5 6 Query… view at source ↗
Figure 10
Figure 10. Figure 10: Token sequence and position ID assignment for IN-CT, LW-AT, and the hybrid IN-CT + LW-AT model. Each paradigm preserves its original position ID scheme independently within the hybrid architecture. each paradigm independently [PITH_FULL_IMAGE:figures/full_fig_p026_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: CKA heatmaps of visual (left) and text (right) token representations across layers for all three integration paradigms on ChartQA [52]. IN-CT exhibits smooth progressive evolution for both visual and text tokens, while LW-GC and LW-AT show severe discontinuities in their visual token representations [PITH_FULL_IMAGE:figures/full_fig_p029_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Attention mass allocated to visual tokens during generation across layers for IN-CT and LW-AT on additional datasets. Attention patterns are task-dependent: general tasks show decreasing visual attention in deeper layers (MME [13] and CV￾Bench 3D [65]) , while text-centric tasks maintain high visual attention throughout (OCR-Bench [46] and DocVQA [53]) [PITH_FULL_IMAGE:figures/full_fig_p030_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: 3D PCA projections of image (red) and text (blue) token representations at every layer for IN-CT. Visual tokens progressively converge toward the text token subspace across layers, with clear merging visible in the final layers [PITH_FULL_IMAGE:figures/full_fig_p031_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: 3D PCA projections of image (red) and text (blue) token representations at every layer for LW-AT. Unlike IN-CT, visual representations remain orthogonal to the text token subspace throughout all layers, with no convergence observed. Layer 0 Layer 1 Layer 2 Layer 3 Layer 4 Layer 5 Layer 6 Layer 7 Layer 8 Layer 9 Layer 10 Layer 11 Layer 12 Layer 13 Layer 14 Layer 15 Layer 16 Layer 17 Layer 18 Layer 19 Layer… view at source ↗
Figure 15
Figure 15. Figure 15: 3D PCA projections of image (red) and text (blue) token representations at every layer for LW-GC. Similar to LW-AT, visual and text representations occupy dis￾tinct subspaces throughout the network, consistent with the lack of progressive trans￾formation in layer-wise injection paradigms [PITH_FULL_IMAGE:figures/full_fig_p032_15.png] view at source ↗
read the original abstract

Visual tokens enter Large Language Models (LLMs) as raw, foreign signals. How they are transformed into meaningful representations and interact with the language space depends entirely on the integration architecture. Whether by treating visual tokens as in-context prompts within the input sequence or injecting them directly into the LLM's intermediate layers. A controlled comparison and understanding of how these architectural choices affect visual information and its internal transformation to integrate with the LLM remains underexplored. We provide a fair comparison by evaluating in-context and layer-wise injection VLM integration paradigms under identical training conditions across single image, multi-image, and video benchmarks. In doing so, we uncover a hidden evolution where visual tokens enter the LLM as disguised visual context, raw representations lacking linguistic structure, but are progressively reshaped depending on the integration paradigm, each capturing fundamentally different frequency characteristics of the visual signal. We show that this evolution inside the LLM determines what visual features the VLM can utilize effectively, how visual representations align with the language space, and ultimately how each paradigm performs across different tasks. We further demonstrate that attention allocation alone is insufficient, and that performance is driven by the quality of visual representations at each layer.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper compares two VLM integration paradigms—in-context prompting of visual tokens versus layer-wise injection—under identical training conditions on single-image, multi-image, and video benchmarks. It claims that visual tokens enter the LLM as raw 'disguised visual context' lacking linguistic structure and undergo paradigm-dependent progressive reshaping that captures fundamentally different frequency characteristics of the visual signal; this internal evolution determines usable visual features, alignment with language space, and task performance. The work further asserts that attention allocation alone is insufficient and that performance is driven by the quality of visual representations at each layer.

Significance. If the results hold after verification of identical training conditions, the work would provide a useful empirical dissection of how architectural choices shape internal visual representations in VLMs. The multi-benchmark controlled comparison and the emphasis on layer-wise representation quality (rather than attention alone) could inform future VLM design. The absence of equations, fitted parameters, or statistical controls noted in the abstract, however, limits immediate assessment of the frequency-characteristic claims.

major comments (1)
  1. [Abstract] Abstract (and experimental setup section): The central claim that paradigm-dependent reshaping and frequency differences drive performance gaps requires that the only systematic difference between in-context and layer-wise setups is the integration architecture. The manuscript states the paradigms were evaluated 'under identical training conditions,' yet supplies no explicit verification (optimizer, learning-rate schedule, batch construction, number of steps, or visual-encoder update policy). This assumption is load-bearing; any mismatch would allow training artifacts to explain the reported evolution and benchmark differences.
minor comments (2)
  1. [Abstract] Abstract: The term 'disguised visual context' is introduced without definition or citation, which reduces clarity for readers.
  2. [Abstract] Abstract: No dataset names, sizes, or error-bar reporting are mentioned, making it difficult to gauge the scale and statistical reliability of the frequency-characteristic observations.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the detailed feedback emphasizing the need for transparent verification of our controlled experimental setup. We address the single major comment below.

read point-by-point responses
  1. Referee: [Abstract] Abstract (and experimental setup section): The central claim that paradigm-dependent reshaping and frequency differences drive performance gaps requires that the only systematic difference between in-context and layer-wise setups is the integration architecture. The manuscript states the paradigms were evaluated 'under identical training conditions,' yet supplies no explicit verification (optimizer, learning-rate schedule, batch construction, number of steps, or visual-encoder update policy). This assumption is load-bearing; any mismatch would allow training artifacts to explain the reported evolution and benchmark differences.

    Authors: We agree that explicit verification is essential to substantiate the claim that architectural integration is the sole systematic difference. The original manuscript asserted identical training conditions but did not enumerate the specific hyperparameters. In revision we will add a dedicated paragraph (or table) in the experimental setup section specifying the optimizer, learning-rate schedule, batch construction, total steps, and visual-encoder update policy for both paradigms, thereby confirming the controlled comparison and removing any ambiguity about training artifacts. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical observations from controlled experiments

full rationale

The paper presents an empirical comparison of two VLM integration paradigms evaluated under identical training conditions, with the reported 'hidden evolution' and frequency characteristics framed as direct observations from internal representation analysis rather than quantities derived from fitted parameters, self-definitions, or self-citation chains. No equations, ansatzes, or predictions appear in the provided text that reduce claims to inputs by construction; the central assertions rest on experimental isolation of architectural differences, which is self-contained against external benchmarks and does not invoke load-bearing self-citations or renaming of known results.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract supplies no explicit free parameters, axioms, or invented entities; all claims rest on the unstated assumption that the training conditions were truly identical and that frequency analysis was performed without post-hoc selection.

pith-pipeline@v0.9.1-grok · 5740 in / 1117 out tokens · 28769 ms · 2026-06-26T18:06:54.584881+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

85 extracted references · 30 canonical work pages · 16 internal anchors

  1. [1]

    GPT-4 Technical Report

    Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F.L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., et al.: Gpt-4 technical report. arXiv preprint arXiv:2303.08774 (2023)

  2. [2]

    Advances in neural information processing systems35, 23716– 23736 (2022)

    Alayrac, J.B., Donahue, J., Luc, P., Miech, A., Barr, I., Hasson, Y., Lenc, K., Men- sch, A., Millican, K., Reynolds, M., et al.: Flamingo: a visual language model for few-shot learning. Advances in neural information processing systems35, 23716– 23736 (2022)

  3. [3]

    Alex, T., Suharitdamrong, W., Atito, S., Mustafa, A., Jackson, P.J.B., Razzak, I., Awais, M.: Pal: Probing audio encoders via llms – audio information transfer into llms (2026),https://arxiv.org/abs/2506.10423v1 16 Suharitdamrong et al

  4. [4]

    LLaVA-OneVision-1.5: Fully Open Framework for Democratized Multimodal Training

    An, X., Xie, Y., Yang, K., Zhang, W., Zhao, X., Cheng, Z., Wang, Y., Xu, S., Chen, C., Zhu, D., et al.: Llava-onevision-1.5: Fully open framework for democratized multimodal training. arXiv preprint arXiv:2509.23661 (2025)

  5. [5]

    OpenFlamingo: An Open-Source Framework for Training Large Autoregressive Vision-Language Models

    Awadalla, A., Gao, I., Gardner, J., Hessel, J., Hanafy, Y., Zhu, W., Marathe, K., Bitton, Y., Gadre, S., Sagawa, S., et al.: Openflamingo: An open-source framework for training large autoregressive vision-language models. arXiv preprint arXiv:2308.01390 (2023)

  6. [6]

    Qwen3-VL Technical Report

    Bai, S., Cai, Y., Chen, R., Chen, K., Chen, X., Cheng, Z., Deng, L., Ding, W., Gao, C., Ge, C., et al.: Qwen3-vl technical report. arXiv preprint arXiv:2511.21631 (2025)

  7. [7]

    Advances in Neural Information Processing Systems37, 7400–7426 (2024)

    Basu, S., Grayson, M., Morrison, C., Nushi, B., Feizi, S., Massiceti, D.: Under- standing information storage and transfer in multi-modal large language models. Advances in Neural Information Processing Systems37, 7400–7426 (2024)

  8. [8]

    In: European Conference on Computer Vision

    Chen, L., Zhao, H., Liu, T., Bai, S., Lin, J., Zhou, C., Chang, B.: An image is worth 1/2 tokens after layer 2: Plug-and-play inference acceleration for large vision-language models. In: European Conference on Computer Vision. pp. 19–35. Springer (2024)

  9. [9]

    Advances in neural information processing systems36, 49250–49267 (2023)

    Dai, W., Li, J., Li, D., Tiong, A., Zhao, J., Wang, W., Li, B., Fung, P.N., Hoi, S.: Instructblip: Towards general-purpose vision-language models with instruction tuning. Advances in neural information processing systems36, 49250–49267 (2023)

  10. [10]

    In: Proceedings of the Computer Vision and Pattern Recognition Conference

    Deitke, M., Clark, C., Lee, S., Tripathi, R., Yang, Y., Park, J.S., Salehi, M., Muen- nighoff, N., Lo, K., Soldaini, L., et al.: Molmo and pixmo: Open weights and open data for state-of-the-art vision-language models. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 91–104 (2025)

  11. [11]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision

    Diao, H., Li, X., Cui, Y., Wang, Y., Deng, H., Pan, T., Wang, W., Lu, H., Wang, X.: Evev2: Improved baselines for encoder-free vision-language models. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 21014–21025 (2025)

  12. [12]

    In: Findings of the association for computational linguistics: EMNLP 2022

    Eichenberg, C., Black, S., Weinbach, S., Parcalabescu, L., Frank, A.: Magma– multimodal augmentation of generative models through adapter-based finetuning. In: Findings of the association for computational linguistics: EMNLP 2022. pp. 2416–2428 (2022)

  13. [13]

    MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models

    Fu, C., Chen, P., Shen, Y., Qin, Y., Zhang, M., Lin, X., Yang, J., Zheng, X., Li, K., Sun, X., et al.: Mme: A comprehensive evaluation benchmark for multimodal large language models. arXiv preprint arXiv:2306.13394 (2023)

  14. [14]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Fu, C., Dai, Y., Luo, Y., Li, L., Ren, S., Zhang, R., Wang, Z., Zhou, C., Shen, Y., Zhang, M., et al.: Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 24108–24118 (2025)

  15. [15]

    arXiv preprint arXiv:2506.08008 (2025)

    Fu, S., Bonnen, T., Guillory, D., Darrell, T.: Hidden in plain sight: Vlms overlook their visual representations. arXiv preprint arXiv:2506.08008 (2025)

  16. [16]

    arXiv preprint arXiv:2503.03983 (2025)

    Ghosh, S., Kong, Z., Kumar, S., Sakshi, S., Kim, J., Ping, W., Valle, R., Manocha, D., Catanzaro, B.: Audio flamingo 2: An audio-language model with long-audio un- derstanding and expert reasoning abilities. arXiv preprint arXiv:2503.03983 (2025)

  17. [17]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Han, J., Gong, K., Zhang, Y., Wang, J., Zhang, K., Lin, D., Qiao, Y., Gao, P., Yue, X.: Onellm: One framework to align all modalities with language. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 26584–26595 (2024)

  18. [18]

    In: Proceedings The Hidden Evolution of Disguised Visual Context inside the VLM 17 of the IEEE/CVF conference on computer vision and pattern recognition

    Hong, W., Wang, W., Lv, Q., Xu, J., Yu, W., Ji, J., Wang, Y., Wang, Z., Dong, Y., Ding, M., et al.: Cogagent: A visual language model for gui agents. In: Proceedings The Hidden Evolution of Disguised Visual Context inside the VLM 17 of the IEEE/CVF conference on computer vision and pattern recognition. pp. 14281–14290 (2024)

  19. [19]

    In: Proceedings of the IEEE/CVF confer- ence on computer vision and pattern recognition

    Hudson, D.A., Manning, C.D.: Gqa: A new dataset for real-world visual reasoning and compositional question answering. In: Proceedings of the IEEE/CVF confer- ence on computer vision and pattern recognition. pp. 6700–6709 (2019)

  20. [20]

    In: International conference on machine learning

    Jaegle, A., Gimeno, F., Brock, A., Vinyals, O., Zisserman, A., Carreira, J.: Per- ceiver: General perception with iterative attention. In: International conference on machine learning. pp. 4651–4664. PMLR (2021)

  21. [21]

    arXiv preprint arXiv:2410.02762 (2024)

    Jiang, N., Kachinthaya, A., Petryk, S., Gandelsman, Y.: Interpreting and edit- ing vision-language representations to mitigate hallucinations. arXiv preprint arXiv:2410.02762 (2024)

  22. [22]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Jiang, Z., Chen, J., Zhu, B., Luo, T., Shen, Y., Yang, X.: Devils in middle lay- ers of large vision-language models: Interpreting, detecting and mitigating object hallucinations via attention lens. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 25004–25014 (2025)

  23. [23]

    In: Proceedings of the Computer Vision and Pattern Recognition Conference

    Kaduri, O., Bagon, S., Dekel, T.: What’s in the image? a deep-dive into the vision of vision language models. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 14549–14558 (2025)

  24. [24]

    In: Proceedings of the IEEE conference on computer vision and pattern recognition

    Kafle,K.,Price,B.,Cohen,S.,Kanan,C.:Dvqa:Understandingdatavisualizations via question answering. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 5648–5656 (2018)

  25. [25]

    arXiv preprint arXiv:2503.03321 (2025)

    Kang, S., Kim, J., Kim, J., Hwang, S.J.: See what you are told: Visual attention sink in large multimodal models. arXiv preprint arXiv:2503.03321 (2025)

  26. [26]

    In: European conference on computer vision

    Kembhavi, A., Salvato, M., Kolve, E., Seo, M., Hajishirzi, H., Farhadi, A.: A di- agram is worth a dozen images. In: European conference on computer vision. pp. 235–251. Springer (2016)

  27. [27]

    In: European Conference on Computer Vision

    Kim, G., Hong, T., Yim, M., Nam, J., Park, J., Yim, J., Hwang, W., Yun, S., Han, D., Park, S.: Ocr-free document understanding transformer. In: European Conference on Computer Vision. pp. 498–517. Springer (2022)

  28. [28]

    In: International conference on machine learning

    Kornblith, S., Norouzi, M., Lee, H., Hinton, G.: Similarity of neural network rep- resentations revisited. In: International conference on machine learning. pp. 3519–

  29. [29]

    Advances in Neural Information Processing Systems36, 71683–71702 (2023)

    Laurençon, H., Saulnier, L., Tronchon, L., Bekman, S., Singh, A., Lozhkov, A., Wang, T., Karamcheti, S., Rush, A., Kiela, D., et al.: Obelics: An open web- scale filtered dataset of interleaved image-text documents. Advances in Neural Information Processing Systems36, 71683–71702 (2023)

  30. [30]

    Laurençon, H., Tronchon, L., Cord, M., Sanh, V.: What matters when building vision-language models? Advances in Neural Information Processing Systems37, 87874–87907 (2024)

  31. [31]

    IEEE Transactions on Pattern Analysis and Machine Intelligence (2025)

    Li, B., Zhang, Y., Chen, L., Wang, J., Pu, F., Cahyono, J.A., Yang, J., Li, C., Liu, Z.: Otter: A multi-modal model with in-context instruction tuning. IEEE Transactions on Pattern Analysis and Machine Intelligence (2025)

  32. [32]

    LLaVA-OneVision: Easy Visual Task Transfer

    Li, B., Zhang, Y., Guo, D., Zhang, R., Li, F., Zhang, H., Zhang, K., Zhang, P., Li, Y., Liu, Z., et al.: Llava-onevision: Easy visual task transfer. arXiv preprint arXiv:2408.03326 (2024)

  33. [33]

    SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension

    Li, B., Wang, R., Wang, G., Ge, Y., Ge, Y., Shan, Y.: Seed-bench: Benchmarking multimodal llms with generative comprehension. arXiv preprint arXiv:2307.16125 (2023)

  34. [34]

    In: International conference on machine learning

    Li, J., Li, D., Savarese, S., Hoi, S.: Blip-2: Bootstrapping language-image pre- training with frozen image encoders and large language models. In: International conference on machine learning. pp. 19730–19742. PMLR (2023) 18 Suharitdamrong et al

  35. [35]

    Li, Y., Wang, C., Jia, J.: Llama-vid: An image is worth 2 tokens in large lan- guagemodels.In:EuropeanConferenceonComputerVision.pp.323–340.Springer (2024)

  36. [36]

    Ad- vances in Neural Information Processing Systems35, 17612–17625 (2022)

    Liang, V.W., Zhang, Y., Kwon, Y., Yeung, S., Zou, J.Y.: Mind the gap: Under- standing the modality gap in multi-modal contrastive representation learning. Ad- vances in Neural Information Processing Systems35, 17612–17625 (2022)

  37. [37]

    IEEE Transactions on Multimedia (2026)

    Lin, B., Tang, Z., Ye, Y., Huang, J., Zhang, J., Pang, Y., Jin, P., Ning, M., Luo, J., Yuan, L.: Moe-llava: Mixture of experts for large vision-language models. IEEE Transactions on Multimedia (2026)

  38. [38]

    In: Proceedings of the 2024 conference on empirical methods in natural language processing

    Lin, B., Ye, Y., Zhu, B., Cui, J., Ning, M., Jin, P., Yuan, L.: Video-llava: Learning united visual representation by alignment before projection. In: Proceedings of the 2024 conference on empirical methods in natural language processing. pp. 5971– 5984 (2024)

  39. [39]

    In: Proceedings of the Computer Vision and Pattern Recognition Conference

    Lin, J., Chen, H., Fan, Y., Fan, Y., Jin, X., Su, H., Fu, J., Shen, X.: Multi-layer visual feature fusion in multimodal llms: Methods, analysis, and best practices. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 4156–4166 (2025)

  40. [40]

    In: European conference on computer vision

    Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., Zitnick, C.L.: Microsoft coco: Common objects in context. In: European conference on computer vision. pp. 740–755. Springer (2014)

  41. [41]

    SPHINX: The Joint Mixing of Weights, Tasks, and Visual Embeddings for Multi-modal Large Language Models

    Lin, Z., Liu, C., Zhang, R., Gao, P., Qiu, L., Xiao, H., Qiu, H., Lin, C., Shao, W., Chen, K., et al.: Sphinx: The joint mixing of weights, tasks, and visual embeddings for multi-modal large language models. arXiv preprint arXiv:2311.07575 (2023)

  42. [42]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Liu, H., Li, C., Li, Y., Lee, Y.J.: Improved baselines with visual instruction tun- ing. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 26296–26306 (2024)

  43. [43]

    io/blog/2024-01-30-llava-next/

    Liu, H., Li, C., Li, Y., Li, B., Zhang, Y., Shen, S., Lee, Y.J.: Llava-next: Improved reasoning, ocr, and world knowledge (January 2024),https://llava-vl.github. io/blog/2024-01-30-llava-next/

  44. [44]

    Advances in neural information processing systems36, 34892–34916 (2023)

    Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in neural information processing systems36, 34892–34916 (2023)

  45. [45]

    Liu, Y., Duan, H., Zhang, Y., Li, B., Zhang, S., Zhao, W., Yuan, Y., Wang, J., He, C., Liu, Z., et al.: Mmbench: Is your multi-modal model an all-around player? In: European conference on computer vision. pp. 216–233. Springer (2024)

  46. [46]

    Science China Information Sciences67(12), 220102 (2024)

    Liu, Y., Li, Z., Huang, M., Yang, B., Yu, W., Li, C., Yin, X.C., Liu, C.L., Jin, L., Bai, X.: Ocrbench: on the hidden mystery of ocr in large multimodal models. Science China Information Sciences67(12), 220102 (2024)

  47. [47]

    Advances in neural information processing systems35, 2507– 2521 (2022)

    Lu, P., Mishra, S., Xia, T., Qiu, L., Chang, K.W., Zhu, S.C., Tafjord, O., Clark, P., Kalyan, A.: Learn to explain: Multimodal reasoning via thought chains for science question answering. Advances in neural information processing systems35, 2507– 2521 (2022)

  48. [48]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Luo, G., Yang, X., Dou, W., Wang, Z., Liu, J., Dai, J., Qiao, Y., Zhu, X.: Mono- internvl: Pushing the boundaries of monolithic multimodal large language models with endogenous visual pre-training. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 24960–24971 (2025)

  49. [49]

    arXiv preprint arXiv:2408.11795 (2024)

    Ma, F., Zhou, Y., Zhang, Z., Yan, S., Li, H., He, Z., Wu, S., Rao, F., Zhang, Y., Sun, X.: Ee-mllm:A data-efficient and compute-efficient multimodal largelanguage model. arXiv preprint arXiv:2408.11795 (2024)

  50. [50]

    In: Proceedings of the 17th Conference of the Eu- ropean Chapter of the Association for Computational Linguistics

    Mañas, O., Lopez, P.R., Ahmadi, S., Nematzadeh, A., Goyal, Y., Agrawal, A.: Mapl: Parameter-efficient adaptation of unimodal pre-trained models for vision- The Hidden Evolution of Disguised Visual Context inside the VLM 19 language few-shot prompting. In: Proceedings of the 17th Conference of the Eu- ropean Chapter of the Association for Computational Lin...

  51. [51]

    Advances in Neural Information Processing Systems36, 46212–46244 (2023)

    Mangalam, K., Akshulakov, R., Malik, J.: Egoschema: A diagnostic benchmark for very long-form video language understanding. Advances in Neural Information Processing Systems36, 46212–46244 (2023)

  52. [52]

    In: Findings of the association for computational linguistics: ACL 2022

    Masry, A., Do, X.L., Tan, J.Q., Joty, S., Hoque, E.: Chartqa: A benchmark for question answering about charts with visual and logical reasoning. In: Findings of the association for computational linguistics: ACL 2022. pp. 2263–2279 (2022)

  53. [53]

    In: Proceedings of the IEEE/CVF winter conference on applications of computer vision

    Mathew, M., Karatzas, D., Jawahar, C.: Docvqa: A dataset for vqa on document images. In: Proceedings of the IEEE/CVF winter conference on applications of computer vision. pp. 2200–2209 (2021)

  54. [54]

    In: Proceedings of the Cray User Group, pp

    McIntosh-Smith, S., Alam, S., Woods, C.: Isambard-ai: a leadership-class super- computer optimised specifically for artificial intelligence. In: Proceedings of the Cray User Group, pp. 44–54 (2024)

  55. [55]

    Ad- vances in Neural Information Processing Systems37, 23464–23487 (2024)

    Meng, L., Yang, J., Tian, R., Dai, X., Wu, Z., Gao, J., Jiang, Y.G.: Deepstack: Deeply stacking visual tokens is surprisingly simple and effective for lmms. Ad- vances in Neural Information Processing Systems37, 23464–23487 (2024)

  56. [56]

    arXiv preprint arXiv:2209.15162 (2022)

    Merullo, J., Castricato, L., Eickhoff, C., Pavlick, E.: Linearly mapping from image to text space. arXiv preprint arXiv:2209.15162 (2022)

  57. [57]

    In: 2019 international conference on document analysis and recognition (ICDAR)

    Mishra, A., Shekhar, S., Singh, A.K., Chakraborty, A.: Ocr-vqa: Visual question answering by reading text in images. In: 2019 international conference on document analysis and recognition (ICDAR). pp. 947–952. IEEE (2019)

  58. [58]

    arXiv preprint arXiv:2410.07149 (2024)

    Neo, C., Ong, L., Torr, P., Geva, M., Krueger, D., Barez, F.: Towards inter- preting visual information processing in vision-language models. arXiv preprint arXiv:2410.07149 (2024)

  59. [59]

    Park, N., Kim, S.: How do vision transformers work? arXiv preprint arXiv:2202.06709 (2022)

  60. [60]

    Park, N., Kim, W., Heo, B., Kim, T., Yun, S.: What do self-supervised vision transformers learn? arXiv preprint arXiv:2305.00729 (2023)

  61. [61]

    LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding

    Shen, X., Xiong, Y., Zhao, C., Wu, L., Chen, J., Zhu, C., Liu, Z., Xiao, F., Varadarajan, B., Bordes, F., et al.: Longvu: Spatiotemporal adaptive compression for long video-language understanding. arXiv preprint arXiv:2410.17434 (2024)

  62. [62]

    Advances in Neural Information Processing Sys- tems37, 130848–130886 (2024)

    Shukor, M., Cord, M.: Implicit multimodal alignment: On the generalization of frozen llms to multimodal inputs. Advances in Neural Information Processing Sys- tems37, 130848–130886 (2024)

  63. [63]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Singh, A., Natarajan, V., Shah, M., Jiang, Y., Chen, X., Batra, D., Parikh, D., Rohrbach, M.: Towards vqa models that can read. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 8317–8326 (2019)

  64. [64]

    Gemini: A Family of Highly Capable Multimodal Models

    Team, G., Anil, R., Borgeaud, S., Alayrac, J.B., Yu, J., Soricut, R., Schalkwyk, J., Dai, A.M., Hauth, A., Millican, K., et al.: Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805 (2023)

  65. [65]

    Advances in Neural Information Processing Systems 37, 87310–87356 (2024)

    Tong, P., Brown, E., Wu, P., Woo, S., Iyer, A.J.V., Akula, S.C., Yang, S., Yang, J., Middepogu, M., Wang, Z., et al.: Cambrian-1: A fully open, vision-centric ex- ploration of multimodal llms. Advances in Neural Information Processing Systems 37, 87310–87356 (2024)

  66. [66]

    SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

    Tschannen, M., Gritsenko, A., Wang, X., Naeem, M.F., Alabdulmohsin, I., Parthasarathy, N., Evans, T., Beyer, L., Xia, Y., Mustafa, B., et al.: Siglip 2: Multilingual vision-language encoders with improved semantic understanding, lo- calization, and dense features. arXiv preprint arXiv:2502.14786 (2025) 20 Suharitdamrong et al

  67. [67]

    Advances in Neural Infor- mation Processing Systems34, 200–212 (2021)

    Tsimpoukelli, M., Menick, J.L., Cabi, S., Eslami, S., Vinyals, O., Hill, F.: Multi- modal few-shot learning with frozen language models. Advances in Neural Infor- mation Processing Systems34, 200–212 (2021)

  68. [68]

    arXiv preprint arXiv:2503.20680 (2025)

    Wang, H., Ye, Y., Li, B., Nie, Y., Lu, J., Tang, J., Wang, Y., Huang, C.: Vision as lora. arXiv preprint arXiv:2503.20680 (2025)

  69. [69]

    Advances in Neural Information Processing Systems37, 121475–121499 (2024)

    Wang, W., Lv, Q., Yu, W., Hong, W., Qi, J., Wang, Y., Ji, J., Yang, Z., Zhao, L., XiXuan, S., et al.: Cogvlm: Visual expert for pretrained language models. Advances in Neural Information Processing Systems37, 121475–121499 (2024)

  70. [70]

    Advances in Neural Information Pro- cessing Systems37, 28828–28857 (2024)

    Wu, H., Li, D., Chen, B., Li, J.: Longvideobench: A benchmark for long-context interleaved video-language understanding. Advances in Neural Information Pro- cessing Systems37, 28828–28857 (2024)

  71. [71]

    xAI: Grok-1.5 vision preview.https://x.ai/news/grok-1.5v(April 2024), ac- cessed: 2026-03-08

  72. [72]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Xiao,J.,Shang,X.,Yao,A.,Chua,T.S.:Next-qa:Nextphaseofquestion-answering to explaining temporal actions. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 9777–9786 (2021)

  73. [73]

    Qwen2.5 Technical Report

    Yang, A., Li, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Gao, C., Huang, C., Lv, C., et al.: Qwen2.5 technical report. arXiv preprint arXiv:2412.15115 (2025)

  74. [74]

    mPLUG-Owl3: Towards Long Image-Sequence Understanding in Multi-Modal Large Language Models

    Ye, J., Xu, H., Liu, H., Hu, A., Yan, M., Qian, Q., Zhang, J., Huang, F., Zhou, J.: mplug-owl3: Towards long image-sequence understanding in multi-modal large language models. arXiv preprint arXiv:2408.04840 (2024)

  75. [75]

    arXiv preprint arXiv:2508.20279 (2025)

    Yu, Z., Lee, Y.J.: How multimodal llms solve image tasks: A lens on visual ground- ing, task reasoning, and answer decoding. arXiv preprint arXiv:2508.20279 (2025)

  76. [76]

    arXiv preprint arXiv:2506.16691 (2025)

    Yue, T., Guo, L., Tang, Y., Zhao, Z., Zhu, X., Huang, H., Liu, J.: Lavi: Effi- cient large vision-language models via internal feature modulation. arXiv preprint arXiv:2506.16691 (2025)

  77. [77]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Yue, X., Ni, Y., Zhang, K., Zheng, T., Liu, R., Zhang, G., Stevens, S., Jiang, D., Ren, W., Sun, Y., et al.: Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 9556–9567 (2024)

  78. [78]

    Zellers,R.,Holtzman,A.,Bisk,Y.,Farhadi,A.,Choi,Y.:Hellaswag:Canamachine really finish your sentence? In: Proceedings of the 57th annual meeting of the association for computational linguistics. pp. 4791–4800 (2019)

  79. [79]

    VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding

    Zhang, B., Li, K., Cheng, Z., Hu, Z., Yuan, Y., Chen, G., Leng, S., Jiang, Y., Zhang, H., Li, X., et al.: Videollama 3: Frontier multimodal foundation models for image and video understanding. arXiv preprint arXiv:2501.13106 (2025)

  80. [80]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision

    Zhang, K., Shen, Y., Li, B., Liu, Z.: Large multi-modal models can interpret fea- tures in large multi-modal models. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 3650–3661 (2025)

Showing first 80 references.