pith. machine review for the scientific record. sign in

arxiv: 2604.05117 · v1 · submitted 2026-04-06 · 💻 cs.CV · cs.AI· cs.CL

Recognition: no theorem link

Watch Before You Answer: Learning from Visually Grounded Post-Training

Authors on Pith no claims yet

Pith reviewed 2026-05-10 19:56 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.CL
keywords vision-language modelsvideo understandingpost-trainingdata curationvisual groundingreinforcement learningbenchmark qualitymultimodal models
0
0 comments X

The pith

Filtering post-training data to only visually grounded questions improves VLM video understanding by up to 6.2 points using 69 percent of the data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Many video understanding benchmarks and post-training datasets contain 40-60 percent of questions answerable from text alone, without needing visual input. The paper shows that restricting post-training to only the visually grounded subset of questions addresses this issue. When combined with RL-based post-training algorithms, this curation delivers higher performance than the full original dataset while using less data. It also outperforms several more complex post-training methods, indicating that ensuring questions require visual grounding is a major factor in advancing VLM capabilities.

Core claim

Commonly reported long video understanding benchmarks contain 40-60 percent of questions that can be answered using text cues alone, and similar issues pervade post-training datasets. By introducing VidGround to select only the actual visually grounded questions without linguistic biases for post-training, and pairing this with RL-based algorithms, performance improves by up to 6.2 points while using only 69.1 percent of the original post-training data. Data curation with a simple post-training algorithm outperforms several more complex post-training techniques, showing that data quality is a major bottleneck for video understanding in VLMs.

What carries the argument

VidGround, the selection of only visually grounded questions for post-training, which removes text-only solvable examples to force reliance on visual and temporal cues.

If this is right

  • Video understanding in VLMs improves without needing more data or compute resources.
  • Simple data filtering proves more effective than elaborate post-training algorithms.
  • Benchmarks must be redesigned to exclude questions solvable by text alone.
  • Post-training becomes more efficient by discarding non-visually grounded examples.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar filtering could be applied during pre-training to reduce shortcut learning across multimodal tasks.
  • Current VLM video understanding scores may be inflated by benchmarks that allow text-only solutions.
  • Automated detection of visual grounding requirements could become standard in dataset preparation pipelines.

Load-bearing premise

The method for identifying visually grounded questions accurately isolates cases that require visual input without introducing selection bias or removing useful training signal.

What would settle it

If a random subset of the same size as the visually grounded portion produces comparable performance gains, the claim that improvements stem specifically from visual grounding would be falsified.

Figures

Figures reproduced from arXiv: 2604.05117 by Dongfu Jiang, EunJeong Hwang, Huaisong Zhang, Kelsey R. Allen, Penghui Du, Peter West, Ping Nie, Shenhui Zhang, Xuan He, Yiming Jia, Yuxuan Zhang.

Figure 1
Figure 1. Figure 1: Performance decomposition on three video understanding benchmarks for [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Analysis of text-only answerable (TA) questions in video understanding [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Accuracy on visually grounded (VG) questions—questions that cannot be [PITH_FULL_IMAGE:figures/full_fig_p011_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative comparison of reasoning paths on a VideoMMMU art analy [PITH_FULL_IMAGE:figures/full_fig_p014_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Multi-model agreement analysis on Video-R1-260K [ [PITH_FULL_IMAGE:figures/full_fig_p021_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Multi-model data curation pipelines for VidGround-M1 and Vid￾Ground-M2 variants (GPT: GPT-5-mini [37], Qwen: Qwen2.5-VL-7B [7], Gem￾ini: Gemini-3.1-Pro [23]). Both pipelines evaluate all ∼263K samples from Video￾R1-260K [20] in text-only mode. GPT uses single-pass evaluation, Qwen uses circular evaluation with option permutation for MCQ + Pass@10 for open￾ended, and Gemini uses 3-permutation circular evalu… view at source ↗
Figure 7
Figure 7. Figure 7: Additional examples of text-only answerable (TA) questions from [PITH_FULL_IMAGE:figures/full_fig_p027_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Pass@10 distribution for video questions in Video-R1-260K [20]. Pass@10 measures the fraction of questions where the model pro￾duces at least one correct answer across 10 independent text-only samples. Qwen2.5-VL-7B [7] achieves Pass@10 > 0 on 74.5% of video questions without video input [PITH_FULL_IMAGE:figures/full_fig_p028_8.png] view at source ↗
Figure 10
Figure 10. Figure 10: Overall Pass@10 distribution for Video-R1-260K [ [PITH_FULL_IMAGE:figures/full_fig_p029_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Default prompt template. Red text indicates the instructions added for text-only evaluation. Enhanced prompt template for text-only evaluation Which tool is not necessary to make a snow globe? A. Distilled water B. Scissors C. Glitter D. Super glue. Based on the question and options provided, use your knowledge and common sense to determine the most likely correct answer. The video and subtitles are not a… view at source ↗
Figure 13
Figure 13. Figure 13: Art elements analysis question comparing [PITH_FULL_IMAGE:figures/full_fig_p034_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Psychology question. Same comparison setup as Figure [PITH_FULL_IMAGE:figures/full_fig_p035_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Medical science question. Same comparison setup as Figure [PITH_FULL_IMAGE:figures/full_fig_p036_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Clinical medical imaging interpretation question. Same comparison setup [PITH_FULL_IMAGE:figures/full_fig_p037_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Public health question. Same comparison setup as Figure [PITH_FULL_IMAGE:figures/full_fig_p038_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: Diagnostics and laboratory medicine question. Same comparison setup [PITH_FULL_IMAGE:figures/full_fig_p039_18.png] view at source ↗
Figure 19
Figure 19. Figure 19: Chemistry question. Same comparison setup as Figure [PITH_FULL_IMAGE:figures/full_fig_p040_19.png] view at source ↗
Figure 20
Figure 20. Figure 20: Structural engineering question. Same comparison setup as Figure [PITH_FULL_IMAGE:figures/full_fig_p041_20.png] view at source ↗
read the original abstract

It is critical for vision-language models (VLMs) to comprehensively understand visual, temporal, and textual cues. However, despite rapid progress in multimodal modeling, video understanding performance still lags behind text-based reasoning. In this work, we find that progress is even worse than previously assumed: commonly reported long video understanding benchmarks contain 40-60% of questions that can be answered using text cues alone. Furthermore, we find that these issues are also pervasive in widely used post-training datasets, potentially undercutting the ability of post-training to improve VLM video understanding performance. Guided by this observation, we introduce VidGround as a simple yet effective solution: using only the actual visually grounded questions without any linguistic biases for post-training. When used in tandem with RL-based post-training algorithms, this simple technique improves performance by up to 6.2 points relative to using the full dataset, while using only 69.1% of the original post-training data. Moreover, we show that data curation with a simple post-training algorithm outperforms several more complex post-training techniques, highlighting that data quality is a major bottleneck for improving video understanding in VLMs. These results underscore the importance of curating post-training data and evaluation benchmarks that truly require visual grounding to advance the development of more capable VLMs. Project page: http://vidground.etuagi.com.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The paper reports that 40-60% of questions in common long-video understanding benchmarks and post-training datasets are solvable from text alone. It introduces VidGround, a filter that retains only the visually grounded subset (69.1% of the original post-training data) and shows that, when paired with standard RL post-training, this subset yields up to +6.2 points over training on the full unfiltered set. The work further claims that this simple curation step outperforms several more elaborate post-training algorithms, positioning data quality as the primary bottleneck for video VLM performance.

Significance. If the attribution to visual grounding is substantiated, the result would be significant: it supplies concrete evidence that a large fraction of existing video benchmarks and training corpora do not actually require multimodal reasoning, and demonstrates that inexpensive filtering can produce larger gains than current algorithmic post-training innovations. This would redirect attention toward systematic dataset auditing and curation rather than solely toward new RL objectives or architectures.

major comments (3)
  1. [Abstract / §3] Abstract and §3 (filter construction): the paper states that 40-60% of questions are text-only solvable and that the VidGround filter removes precisely the non-grounded remainder, yet provides no description of the procedure used to label questions as text-only (model, prompt template, temperature, decision threshold, or human validation set). Without these details the claimed 69.1% retention rate and the subsequent performance numbers cannot be reproduced or stress-tested.
  2. [§4 / Table 2] §4 (main results) and Table 2: the headline +6.2 point gain is reported only for the VidGround subset versus the full dataset. No control experiment is shown for a random 69.1% subset, a length-matched subset, or a difficulty-matched subset. Consequently the observed lift cannot yet be attributed to removal of text-only questions rather than to incidental effects of reduced volume or altered answer distribution.
  3. [§4.2] §4.2 (RL post-training details): the manuscript asserts that “data curation with a simple post-training algorithm outperforms several more complex post-training techniques,” but does not list the exact RL algorithms, reward models, hyper-parameters, or training budgets used in the comparison. This omission prevents verification that the reported superiority is driven by the data filter rather than by differences in optimization setup.
minor comments (1)
  1. [Abstract] The project page is referenced but no supplementary material, code, or filtered dataset splits are linked in the main text, which would materially aid reproducibility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback, which has identified important opportunities to improve the clarity, reproducibility, and evidential strength of our work. We respond to each major comment below and will incorporate revisions to address the concerns raised.

read point-by-point responses
  1. Referee: [Abstract / §3] Abstract and §3 (filter construction): the paper states that 40-60% of questions are text-only solvable and that the VidGround filter removes precisely the non-grounded remainder, yet provides no description of the procedure used to label questions as text-only (model, prompt template, temperature, decision threshold, or human validation set). Without these details the claimed 69.1% retention rate and the subsequent performance numbers cannot be reproduced or stress-tested.

    Authors: We agree that the absence of these implementation details limits reproducibility. In the revised manuscript we will expand §3 with a full specification of the text-only solvability labeling procedure, including the exact text-only model and version used, the complete prompt template, the temperature setting, the decision threshold or matching criterion for classifying a question as text-only solvable, and the protocol and results of any human validation performed on the automatic labels. These additions will allow independent reproduction of the 69.1% retention rate and the VidGround filter. revision: yes

  2. Referee: [§4 / Table 2] §4 (main results) and Table 2: the headline +6.2 point gain is reported only for the VidGround subset versus the full dataset. No control experiment is shown for a random 69.1% subset, a length-matched subset, or a difficulty-matched subset. Consequently the observed lift cannot yet be attributed to removal of text-only questions rather than to incidental effects of reduced volume or altered answer distribution.

    Authors: We acknowledge that the current results lack these controls and that they are necessary to isolate the contribution of visual grounding. In the revised version we will add an experiment using a randomly sampled 69.1% subset of the original post-training data to demonstrate that simply reducing data volume does not produce comparable gains. We will also include or discuss length-matched and difficulty-matched controls to the extent they can be constructed, along with analysis of any shifts in answer distribution, to strengthen the attribution of the observed improvements to the removal of non-grounded questions. revision: yes

  3. Referee: [§4.2] §4.2 (RL post-training details): the manuscript asserts that “data curation with a simple post-training algorithm outperforms several more complex post-training techniques,” but does not list the exact RL algorithms, reward models, hyper-parameters, or training budgets used in the comparison. This omission prevents verification that the reported superiority is driven by the data filter rather than by differences in optimization setup.

    Authors: We thank the referee for noting this omission. Although the comparisons were performed under a consistent post-training framework, the specific configurations were not enumerated. We will revise §4.2 to include a detailed description or table specifying the RL algorithms, reward models, all hyper-parameters, and training budgets (steps, batch sizes, learning rates, etc.) used both for the VidGround-filtered data and for the more complex post-training baselines. This will confirm that performance differences are attributable to the data curation rather than to variations in the optimization setup. revision: yes

Circularity Check

0 steps flagged

No significant circularity in empirical data filtering and RL evaluation

full rationale

The paper is an empirical study that identifies text-only answerable questions in benchmarks and post-training datasets via measurement, filters to a visually grounded subset, and applies standard RL post-training algorithms to report measured performance gains. No equations, fitted parameters renamed as predictions, self-definitional constructs, or load-bearing self-citations appear in the derivation chain; the central claims rest on external dataset measurements and experimental outcomes rather than reducing to the inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper is empirical and relies on standard assumptions in machine learning evaluation rather than new theoretical constructs.

axioms (1)
  • domain assumption Benchmark performance scores reflect genuine improvements in visual and temporal understanding when text-only shortcuts are removed.
    The central claims rest on interpreting higher benchmark scores after filtering as evidence of better video understanding.

pith-pipeline@v0.9.0 · 5579 in / 1289 out tokens · 69635 ms · 2026-05-10T19:56:05.974352+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. RewardHarness: Self-Evolving Agentic Post-Training

    cs.AI 2026-05 unverdicted novelty 7.0

    RewardHarness self-evolves a tool-and-skill library from 100 preference examples to reach 47.4% accuracy on image-edit evaluation, beating GPT-5, and yields stronger RL-tuned models.

Reference graph

Works this paper leans on

114 extracted references · 25 canonical work pages · cited by 1 Pith paper · 12 internal anchors

  1. [1]

    GPT-4 Technical Report

    Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F.L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., et al.: Gpt-4 technical report. arXiv preprint arXiv:2303.08774 (2023)

  2. [2]

    gpt-oss-120b & gpt-oss-20b Model Card

    Agarwal, S., Ahmad, L., Ai, J., Altman, S., Applebaum, A., Arbus, E., Arora, R.K., Bai, Y., Baker, B., Bao, H., et al.: gpt-oss-120b & gpt-oss-20b model card. arXiv preprint arXiv:2508.10925 (2025)

  3. [3]

    In: Proceedings of the Computer Vision and Pattern Recognition Conference

    Alhamoud, K., Alshammari, S., Tian, Y., Li, G., Torr, P.H., Kim, Y., Ghassemi, M.: Vision-language models do not understand negation. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 29612–29622 (2025)

  4. [4]

    Anthropic: Introducing claude sonnet 4.5 (2025),https://www.anthropic.com/ news/claude-sonnet-4-5

  5. [5]

    Anthropic: Introducing claude opus 4.6 (2026),https://www.anthropic.com/ news/claude-opus-4-6

  6. [6]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Bahl, S., Mendonca, R., Chen, L., Jain, U., Pathak, D.: Affordances from human videos as a versatile representation for robotics. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 13778–13790 (2023)

  7. [7]

    Qwen2.5-VL Technical Report

    Bai, S., Chen, K., Liu, X., Wang, J., Ge, W., Song, S., Dang, K., Wang, P., Wang, S., Tang, J., et al.: Qwen2. 5-vl technical report. arXiv preprint arXiv:2502.13923 (2025)

  8. [8]

    arXiv preprint arXiv:2402.17510 (2024)

    Bleeker, M., Hendriksen, M., Yates, A., de Rijke, M.: Demonstrating and reducing shortcuts in vision-language representation learning. arXiv preprint arXiv:2402.17510 (2024)

  9. [9]

    Advances in neural information processing systems32(2019)

    Cadene, R., Dancette, C., Cord, M., Parikh, D., et al.: Rubi: Reducing unimodal biases for visual question answering. Advances in neural information processing systems32(2019)

  10. [10]

    Advances in Neural Information Processing Systems37, 113436–113460 (2024)

    Campbell, D., Rane, S., Giallanza, T., De Sabbata, C.N., Ghods, K., Joshi, A., Ku, A., Frankland, S., Griffiths, T., Cohen, J.D., et al.: Understanding the limits of vision language models through the lens of the binding problem. Advances in Neural Information Processing Systems37, 113436–113460 (2024)

  11. [11]

    In: Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

    Chandhok, S., Fan, W.C., Shwartz, V., Balasubramanian, V.N., Sigal, L.: Response wide shut? surprising observations in basic vision language model capabilities. In: Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). pp. 25530–25545 (2025)

  12. [12]

    arXiv preprint arXiv:2504.11468 , year=

    Chen, H., Tu, H., Wang, F., Liu, H., Tang, X., Du, X., Zhou, Y., Xie, C.: Sft or rl? an early investigation into training r1-like reasoning large vision-language models. arXiv preprint arXiv:2504.11468 (2025)

  13. [13]

    Retaining by doing: The role of on-policy data in mitigating forgetting.arXiv preprint arXiv:2510.18874, 2025

    Chen, H., Razin, N., Narasimhan, K., Chen, D.: Retaining by doing: The role of on-policy data in mitigating forgetting. arXiv preprint arXiv:2510.18874 (2025)

  14. [14]

    Scaling rl to long videos.arXiv preprint arXiv:2507.07966, 2025

    Chen, Y., Huang, W., Shi, B., Hu, Q., Ye, H., Zhu, L., Liu, Z., Molchanov, P., Kautz, J., Qi, X., et al.: Scaling rl to long videos. arXiv preprint arXiv:2507.07966 (2025)

  15. [15]

    SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training

    Chu, T., Zhai, Y., Yang, J., Tong, S., Xie, S., Schuurmans, D., Le, Q.V., Levine, S., Ma, Y.: Sft memorizes, rl generalizes: A comparative study of foundation model post-training. arXiv preprint arXiv:2501.17161 (2025)

  16. [16]

    Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

    Comanici, G., Bieber, E., Schaekermann, M., Pasupat, I., Sachdeva, N., Dhillon, I., Blistein, M., Ram, O., Zhang, D., Rosen, E., et al.: Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities. arXiv preprint arXiv:2507.06261 (2025)

  17. [17]

    Reinforcing video reasoning with focused thinking

    Dang, J., Wu, J., Wang, T., Lin, X., Zhu, N., Chen, H., Zheng, W.S., Wang, M., Chua, T.S.: Reinforcing video reasoning with focused thinking. arXiv preprint arXiv:2505.24718 (2025)

  18. [18]

    Electronics14(7), 1282 (2025)

    Elhenawy, M., Ashqar, H.I., Rakotonirainy, A., Alhadidi, T.I., Jaber, A., Tami, M.A.: Vision-language models for autonomous driving: Clip-based dynamic scene understanding. Electronics14(7), 1282 (2025)

  19. [19]

    Advances in Neural Information Processing Systems37, 89098–89124 (2024)

    Fang, X., Mao, K., Duan, H., Zhao, X., Li, Y., Lin, D., Chen, K.: Mmbench-video: A long-form multi-shot benchmark for holistic video understanding. Advances in Neural Information Processing Systems37, 89098–89124 (2024)

  20. [20]

    Video-R1: Reinforcing Video Reasoning in MLLMs

    Feng, K., Gong, K., Li, B., Guo, Z., Wang, Y., Peng, T., Wu, J., Zhang, X., Wang, B., Yue, X.: Video-r1: Reinforcing video reasoning in mllms. arXiv preprint arXiv:2503.21776 (2025)

  21. [21]

    In: Proceedings of the Computer Vision and Pattern Recognition Conference

    Fu, C., Dai, Y., Luo, Y., Li, L., Ren, S., Zhang, R., Wang, Z., Zhou, C., Shen, Y., Zhang, M., et al.: Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 24108–24118 (2025)

  22. [22]

    Hidden in plain sight: Vlms overlook their visual representations.arXiv preprint arXiv:2506.08008,

    Fu, S., Bonnen, T., Guillory, D., Darrell, T.: Hidden in plain sight: Vlms overlook their visual representations. arXiv preprint arXiv:2506.08008 (2025)

  23. [23]

    Google: Gemini 3.1 pro: Best for complex tasks and bringing creative concepts to life (2026),https://blog.google/innovation-and-ai/models-and-research/ gemini-models/gemini-3-1-pro/

  24. [24]

    In: Proceedings of the IEEE conference on computer vision and pattern recognition

    Goyal, Y., Khot, T., Summers-Stay, D., Batra, D., Parikh, D.: Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 6904–6913 (2017)

  25. [25]

    Video-MMMU: Evaluating Knowledge Acquisition from Multi-Discipline Professional Videos

    Hu, K., Wu, P., Pu, F., Xiao, W., Zhang, Y., Yue, X., Li, B., Liu, Z.: Video-mmmu: Evaluating knowledge acquisition from multi-discipline professional videos. arXiv preprint arXiv:2501.13826 (2025)

  26. [26]

    arXiv preprint arXiv:2502.11492 (2025)

    Huang, K.H., Qin, C., Qiu, H., Laban, P., Joty, S., Xiong, C., Wu, C.S.: Why vision language models struggle with visual arithmetic? towards enhanced chart and geometry understanding. arXiv preprint arXiv:2502.11492 (2025)

  27. [27]

    GPT-4o System Card

    Hurst, A., Lerer, A., Goucher, A.P., Perelman, A., Ramesh, A., Clark, A., Os- trow, A., Welihinda, A., Hayes, A., Radford, A., et al.: Gpt-4o system card. arXiv preprint arXiv:2410.21276 (2024)

  28. [28]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Jin, P., Takanobu, R., Zhang, W., Cao, X., Yuan, L.: Chat-univi: Unified visual rep- resentation empowers large language models with image and video understanding. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 13700–13710 (2024)

  29. [29]

    In: Findings of the Association for Computational Linguistics: NAACL 2025

    Lee, K.i., Kim, M., Yoon, S., Kim, M., Lee, D., Koh, H., Jung, K.: Vlind-bench: Measuring language priors in large vision-language models. In: Findings of the Association for Computational Linguistics: NAACL 2025. pp. 4129–4144 (2025)

  30. [30]

    In: Proceedings of the 44th international ACM SIGIR conference on research and development in information retrieval

    Liang, Z., Hu, H., Zhu, J.: Lpf: A language-prior feedback objective function for de- biased visual question answering. In: Proceedings of the 44th international ACM SIGIR conference on research and development in information retrieval. pp. 1955– 1959 (2021)

  31. [31]

    In Al-Onaizan, Y ., Bansal, M

    Lin, B., Ye, Y., Zhu, B., Cui, J., Ning, M., Jin, P., Yuan, L.: Video-LLaVA: Learning united visual representation by alignment before projection. In: Al- Onaizan, Y., Bansal, M., Chen, Y.N. (eds.) Proceedings of the 2024 Confer- ence on Empirical Methods in Natural Language Processing. pp. 5971–5984. Association for Computational Linguistics, Miami, Flor...

  32. [32]

    In: Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

    Lin, B., Ye, Y., Zhu, B., Cui, J., Ning, M., Jin, P., Yuan, L.: Video-llava: Learning united visual representation by alignment before projection. In: Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. pp. 5971–5984 (2024)

  33. [33]

    DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models

    Liu, A., Mei, A., Lin, B., Xue, B., Wang, B., Xu, B., Wu, B., Zhang, B., Lin, C., Dong, C., et al.: Deepseek-v3. 2: Pushing the frontier of open large language models. arXiv preprint arXiv:2512.02556 (2025)

  34. [34]

    arXiv preprint arXiv:2501.00569 (2024)

    Luo, T., Cao, A., Lee, G., Johnson, J., Lee, H.: Probing visual language priors in vlms. arXiv preprint arXiv:2501.00569 (2024)

  35. [35]

    In: European Conference on Computer Vision

    Miller, D., S¨ underhauf, N., Kenna, A., Mason, K.: Open-set recognition in the age of vision-language models. In: European Conference on Computer Vision. pp. 1–18. Springer (2024)

  36. [36]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Niu, Y., Tang, K., Zhang, H., Lu, Z., Hua, X.S., Wen, J.R.: Counterfactual vqa: A cause-effect look at language bias. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 12700–12710 (2021)

  37. [37]

    OpenAI: Gpt-5.https://openai.com(2025)

  38. [38]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Parashar, S., Lin, Z., Liu, T., Dong, X., Li, Y., Ramanan, D., Caverlee, J., Kong, S.: The neglected tails in vision-language models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 12988–12997 (2024)

  39. [39]

    In: Proceedings of the AAAI Conference on Artificial Intelligence

    Park, J., Jang, K.J., Alasaly, B., Mopidevi, S., Zolensky, A., Eaton, E., Lee, I., Johnson, K.: Assessing modality bias in video question answering benchmarks with multimodal large language models. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 39, pp. 19821–19829 (2025)

  40. [40]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Peng, W., Xie, S., You, Z., Lan, S., Wu, Z.: Synthesize diagnose and opti- mize: Towards fine-grained vision-language understanding. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 13279– 13288 (2024)

  41. [41]

    In: Proceedings of the Asian Conference on Computer Vision

    Rahmanzadehgervi, P., Bolton, L., Taesiri, M.R., Nguyen, A.T.: Vision language models are blind. In: Proceedings of the Asian Conference on Computer Vision. pp. 18–34 (2024)

  42. [42]

    Advances in neural information processing systems31(2018)

    Ramakrishnan, S., Agrawal, A., Lee, S.: Overcoming language priors in visual question answering with adversarial regularization. Advances in neural information processing systems31(2018)

  43. [43]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Ranasinghe, K., Shukla, S.N., Poursaeed, O., Ryoo, M.S., Lin, T.Y.: Learning to localize objects improves spatial reasoning in visual-llms. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 12977– 12987 (2024)

  44. [44]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Shao, Z., Wang, P., Zhu, Q., Xu, R., Song, J., Bi, X., Zhang, H., Zhang, M., Li, Y., Wu, Y., et al.: Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300 (2024)

  45. [45]

    IEEE Transactions on Circuits and Systems for Video Technology pp

    Tang, Y., Bi, J., Xu, S., Song, L., Liang, S., Wang, T., Zhang, D., An, J., Lin, J., Zhu, R., Vosoughi, A., Huang, C., Zhang, Z., Liu, P., Feng, M., Zheng, F., Zhang, J., Luo, P., Luo, J., Xu, C.: Video understanding with large language models: A survey. IEEE Transactions on Circuits and Systems for Video Technology pp. 1–1 (2025).https://doi.org/10.1109/...

  46. [46]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Tong, S., Liu, Z., Zhai, Y., Ma, Y., LeCun, Y., Xie, S.: Eyes wide shut? exploring the visual shortcomings of multimodal llms. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 9568–9578 (2024)

  47. [47]

    Vision Language Models are Biased

    Vo, A., Nguyen, K.N., Taesiri, M.R., Dang, V.T., Nguyen, A.T., Kim, D.: Vision language models are biased. arXiv preprint arXiv:2505.23941 (2025)

  48. [48]

    Llava-critic-r1: Your critic model is secretly a strong policy model.CoRR, abs/2509.00676, 2025

    Wang, X., Li, C., Yang, J., Zhang, K., Liu, B., Xiong, T., Huang, F.: Llava-critic-r1: Your critic model is secretly a strong policy model. arXiv preprint arXiv:2509.00676 (2025)

  49. [49]

    In: Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

    Wang, Z., Yoon, J., Yu, S., Islam, M.M., Bertasius, G., Bansal, M.: Video-rts: Rethinking reinforcement learning and test-time scaling for efficient and enhanced video reasoning. In: Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. pp. 28114–28128 (2025)

  50. [50]

    von Werra, L., Belkada, Y., Tunstall, L., Beeching, E., Thrush, T., Lambert, N., Huang, S., Rasul, K., Gallou´ edec, Q.: TRL: Transformer Reinforcement Learning, https://github.com/huggingface/trl

  51. [51]

    When language overrules: Modality imbalance in vlms.arXiv preprint arXiv:2508.10552, 2025

    Wu, H., Tang, M., Zheng, X., Jiang, H.: When language overrules: Revealing text dominance in multimodal large language models. arXiv preprint arXiv:2508.10552 (2025)

  52. [52]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision

    Xu, Y., Zhu, L., Yang, Y.: Mc-bench: A benchmark for multi-context visual ground- ing in the era of mllms. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 17675–17687 (2025)

  53. [53]

    In: The Eleventh In- ternational Conference on Learning Representations (2023),https://openreview

    Xue, H., Sun, Y., Liu, B., Fu, J., Song, R., Li, H., Luo, J.: CLIP-vip: Adapting pre-trained image-text model to video-language alignment. In: The Eleventh In- ternational Conference on Learning Representations (2023),https://openreview. net/forum?id=GNjzMAgawq

  54. [54]

    DAPO: An Open-Source LLM Reinforcement Learning System at Scale

    Yu, Q., Zhang, Z., Zhu, R., Yuan, Y., Zuo, X., Yue, Y., Dai, W., Fan, T., Liu, G., Liu, L., et al.: Dapo: An open-source llm reinforcement learning system at scale. arXiv preprint arXiv:2503.14476 (2025)

  55. [55]

    In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision

    Zhang, G., Zhang, Y., Zhang, K., Tresp, V.: Can vision-language models be a good guesser? exploring vlms for times and location reasoning. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. pp. 636–645 (2024)

  56. [56]

    LLaVA-Video: Video Instruction Tuning With Synthetic Data

    Zhang, Y., Wu, J., Li, W., Li, B., Ma, Z., Liu, Z., Li, C.: Video instruction tuning with synthetic data. arXiv preprint arXiv:2410.02713 (2024)

  57. [57]

    Zhang, Y., Unell, A., Wang, X., Ghosh, D., Su, Y., Schmidt, L., Yeung-Levy, S.: Why are visually-grounded language models bad at image classification? Advances in Neural Information Processing Systems37, 51727–51753 (2024)

  58. [58]

    Gemini-1.5-Pro achieves less than 15% accuracy in the text-only setup

    Zhao, Y., Zhang, H., Xie, L., Hu, T., Gan, G., Long, Y., Hu, Z., Chen, W., Li, C., Xu, Z., et al.: Mmvu: Measuring expert-level multi-discipline video understanding. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 8475–8489 (2025) Appendix Appendix Overview AMulti-model agreement analysis BCross-task generalization: image QA...

  59. [59]

    The lines are varied in direction and style, contributing significantly to the overall composition

    Line: There are numerous lines used in the drawings, especially in the abstract and geometric shapes. The lines are varied in direction and style, contributing significantly to the overall composition

  60. [60]

    The use of shapes to create patterns and structures is evident

    Shape: Shapes are also prominent, with both geometric and organic forms present. The use of shapes to create patterns and structures is evident

  61. [61]

    The emphasis is more on two- dimensional elements rather than creating a sense of volume through form

    Form: While there are some three-dimensional elements like the cube and the sphere, they are not the main focus. The emphasis is more on two- dimensional elements rather than creating a sense of volume through form

  62. [62]

    The colors are bright and flat, without much shading

    Value: The use of value (light and shadow) is minimal compared to other elements. The colors are bright and flat, without much shading

  63. [63]

    The colors are used to fill shapes and add contrast, but they do not create a strong color palette

    Color: Color is present, but it is not the most dominant element. The colors are used to fill shapes and add contrast, but they do not create a strong color palette

  64. [64]

    The surfaces are smooth and flat

    Texture: There is no significant texture in the artwork. The surfaces are smooth and flat

  65. [65]

    The arrangement of elements is somewhat balanced, but it is not the primary focus

    Balance: The balance is somewhat achieved, but it is not the most striking feature. The arrangement of elements is somewhat balanced, but it is not the primary focus

  66. [66]

    The artwork does not explore depth or negative space extensively

    Space: The concept of space is present, but it is not the most prominent. The artwork does not explore depth or negative space extensively

  67. [67]

    The colors and lines provide some contrast, but it is not the main focus

    Contrast: There is some contrast, but it is not the most dominant element. The colors and lines provide some contrast, but it is not the main focus

  68. [68]

    Given the analysis, the most prominently featured elements are line and shape

    Proportion: The proportions are somewhat consistent, but they are not the most striking feature. Given the analysis, the most prominently featured elements are line and shape. However, since the question asks for the most prominent, and considering the variety of lines and shapes, the answer leans towards: B. Shape Final answer: B. Shape To determine whic...

  69. [69]

    This suggests that line is a significant element in the artwork

    Line: The video shows a hand drawing various types of lines such as horizontal, diagonal, curved, and dotted. This suggests that line is a significant element in the artwork

  70. [70]

    Shape: The video also shows the drawing of different shapes like triangles, circles, and squares, indicating that shape is another prominent element

  71. [71]

    Form: The video briefly shows a 3D form, which suggests that form is also an element being discussed

  72. [72]

    Value: The video mentions value, which refers to the lightness or darkness of a color, and shows examples of how value can be used in art

  73. [73]

    Color: The video includes a color wheel and discusses primary colors, indicating that color is a key element

  74. [74]

    Texture: The video briefly shows a textured surface, suggesting that texture is also an element being discussed

  75. [75]

    Balance: The video does not focus on balance as a primary element

  76. [76]

    Given the content of the video, the elements that are most prominently featured are likely to be those that are discussed and demonstrated in the video

    Space: The video mentions space, including positive and negative space, indicating that space is a significant element. Given the content of the video, the elements that are most prominently featured are likely to be those that are discussed and demonstrated in the video. Based on the video content, the most prominent elements are: Line, Shape, Color, Tex...

  77. [77]

    The structure introduced at 1:18 is damaged

  78. [78]

    The structure introduced at 0:32 is damaged

  79. [79]

    The introduced at 1:10 is severed or damaged

  80. [80]

    The dendrites did not receive enough neurotransmitter signals to properly initiate the process

Showing first 80 references.