pith. machine review for the scientific record. sign in

arxiv: 2605.02378 · v1 · submitted 2026-05-04 · 💻 cs.CV · cs.AI

Recognition: 1 theorem link

Enhancing Multimodal In-Context Learning via Inductive-Deductive Reasoning

Authors on Pith no claims yet

Pith reviewed 2026-05-08 19:15 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords multimodal in-context learninginductive reasoningvision-language modelschain-of-thoughtvisual token compressionattention rebalancingreinforcement learning
0
0 comments X

The pith

A framework turns fragile multimodal in-context learning into explicit inductive-deductive rule extraction for vision-language models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper identifies that vision-language models often reach correct answers through flawed or inconsistent reasoning when given few-shot examples, rather than extracting reliable rules from the demonstrations. This inductive gap grows worse because redundant visual patches flood the context and attention skews toward the first image. The authors restructure the process around three targeted fixes: compressing similar visual tokens to reduce noise, rebalancing attention so every image receives fair focus, and a chain-of-thought sequence that first examines each example, derives a general rule, then applies that rule to the query. An added training loop of supervised fine-tuning plus reinforcement learning rewards faithful rule citation and noise rejection. If these changes succeed, multimodal models should start behaving more like genuine inductive reasoners across perception, logic, STEM, and sarcasm tasks.

Core claim

The central claim is that multimodal ICL fails primarily due to an inductive gap in which models do not extract consistent rules across visual demonstrations, compounded by redundant visual tokens and imbalanced attention; this can be remedied by recasting ICL as an explicit inductive-deductive pipeline that compresses visual tokens by similarity, rebalances attention dynamically, and forces the model to analyze examples, induce a rule, and deductively apply it, reinforced by auxiliary reinforcement learning that rewards verifiable rule use.

What carries the argument

The inductive-deductive chain-of-thought process, which separates per-example analysis, general-rule derivation, and query application, supported by similarity-based visual token compression and dynamic attention rebalancing.

If this is right

  • Vision-language models will produce correct answers grounded in extracted rules rather than flawed reasoning across visual perception, logical reasoning, STEM, and sarcasm tasks.
  • Reducing redundant visual tokens and rebalancing attention will allow later demonstrations to influence reasoning as much as the first image.
  • The auxiliary reinforcement-learning stage will increase the frequency of faithful citation of the induced rule and filtering of irrelevant visual noise.
  • The same framework will deliver consistent gains when applied to multiple open-source vision-language models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same inductive-deductive separation could be tested in pure text in-context learning to see whether explicit rule derivation improves consistency there as well.
  • The token-compression step might reduce context length enough to allow longer demonstration sets without exceeding model limits.
  • If the rebalancing mechanism proves general, similar attention adjustments could be applied to other multi-image or multi-turn vision tasks outside of in-context learning.

Load-bearing premise

That the performance gains come from genuine rule extraction enabled by the new modules rather than from incidental effects of the added training or prompting steps.

What would settle it

A controlled test that measures whether the model, when prompted to state the derived rule after the analysis step, produces rules that accurately capture the shared pattern in the demonstrations and that using those rules predicts final answer correctness better than standard ICL.

Figures

Figures reproduced from arXiv: 2605.02378 by Gang Liu, Haonan Wang, Haoyu Wang, Jiahong Yan, Jun Chen, Qian Wang, Yanghua Xiao, Yuyan Chen.

Figure 1
Figure 1. Figure 1: Examples of naive and inductive multimodal ICL. For visual perception-based QA, VLMs view at source ↗
Figure 3
Figure 3. Figure 3: Attention heatmap visualization for VLMs during view at source ↗
Figure 4
Figure 4. Figure 4: Overview of the proposed MMInduction framework. (1) Similarity-Guided Visual Token view at source ↗
Figure 5
Figure 5. Figure 5: Training dynamics of the three fine-grained reward components— view at source ↗
Figure 6
Figure 6. Figure 6: Attention heatmaps of VLMs after MMInduction training, demonstrating a balanced view at source ↗
Figure 7
Figure 7. Figure 7: Case study on the VQA dataset. Dataset: MMIQ | Task: Multimodal Knowledge Reasoning | Case ID: #364 Query Q: Choose the most appropriate option from the given four choices to fill in the question mark, so that it presents a certain regularity: GT Answer: B Retrieved Cases Q1: Choose the most appropriate option from the given four choices to fill in the question mark, so that it presents a certain regularit… view at source ↗
Figure 8
Figure 8. Figure 8: Case study on the MMIQ dataset. Vanilla Gemini-3.1-pro and Qwen3-VL fail to reach the view at source ↗
Figure 9
Figure 9. Figure 9: Failure case on the MDK12 dataset where all three methods (Vanilla Gemini-3.1-pro, view at source ↗
read the original abstract

In-context learning (ICL) allows large models to adapt to tasks using a few examples, yet its extension to vision-language models (VLMs) remains fragile. Our analysis reveals that the fundamental limitation lies in an inductive gap, models often produce correct answers from flawed reasoning, while struggling to extract consistent rules across demonstrations. This gap is further exacerbated by two visual-level obstacles: an overwhelming proportion of redundant visual tokens that obscure textual cues, and a skewed attention distribution that favors the initial image at the expense of subsequent context. To address these issues, we introduce a framework that restructures multimodal ICL as a principled inductive-deductive process. The framework incorporates a similarity-based visual token compression module to filter out redundant patches, a dynamic attention rebalancing mechanism to distribute focus equitably across all images, and a chain-of-thought paradigm that explicitly guides the model to analyze individual examples, derive a generalizable rule, and then apply it to the query. An auxiliary learning pipeline combines supervised fine-tuning with reinforcement learning using verifiable rewards to reinforce faithful citation and noise filtering. Evaluations across eight benchmarks covering visual perception, logical reasoning, STEM problems, and sarcasm detection demonstrate consistent and significant improvements over standard ICL baselines for multiple open-source VLMs, highlighting the potential of equipping models with genuine inductive capabilities in multimodal settings.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims that multimodal ICL in VLMs is limited by an 'inductive gap' (correct answers from flawed reasoning, failure to extract consistent rules across demonstrations), worsened by redundant visual tokens and skewed attention favoring the first image. It proposes a framework restructuring ICL as inductive-deductive: similarity-based visual token compression to filter redundant patches, dynamic attention rebalancing for equitable focus, and a CoT paradigm that analyzes examples, derives a generalizable rule, then applies it to the query. An auxiliary SFT+RL pipeline with verifiable rewards reinforces faithful citation and noise filtering. Evaluations on eight benchmarks (visual perception, logical reasoning, STEM, sarcasm detection) show consistent significant gains over standard ICL for open-source VLMs.

Significance. If the gains are shown to stem specifically from the inductive-deductive CoT enabling rule extraction (rather than from token compression or attention fixes), the work would meaningfully advance multimodal ICL by offering a principled separation of induction and deduction. The verifiable-reward RL component is a constructive element for controllable training. The result would highlight a path toward more reliable generalization from visual demonstrations, with potential impact on reasoning-heavy VLM applications.

major comments (2)
  1. The central claim that the framework equips models with 'genuine inductive capabilities' (abstract) is load-bearing but unsupported by isolating evidence. The three modules (token compression, attention rebalancing, inductive-deductive CoT) are bundled; without ablations that retain compression/rebalancing while removing the CoT (analyze-derive-apply) step, or direct probes of rule fidelity (e.g., consistency of derived rules across held-out demonstration sets or counterfactual query tests), benchmark gains cannot be attributed to rule extraction rather than noise reduction. The RL 'verifiable rewards' for citation and filtering are described but lack explicit rule-consistency metrics.
  2. [§5] §5 (Experiments): the abstract asserts 'consistent and significant improvements' across eight benchmarks for multiple VLMs, yet the manuscript supplies no statistical significance tests, run-to-run variance, or comparisons against strong baselines that apply only compression/rebalancing without the CoT. This weakens the cross-benchmark claim and leaves open whether the inductive-deductive component is necessary.
minor comments (2)
  1. The term 'inductive gap' is introduced without a precise operational definition or contrast to standard ICL failure modes; a short formalization or illustrative example in the introduction would improve clarity.
  2. The eight benchmarks are referenced in the abstract but not enumerated with citations or task descriptions in the provided text; this should be added to §1 or §5 for reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for the constructive review and for recognizing the potential of our inductive-deductive framework in advancing multimodal in-context learning. We address the major comments point by point below. We agree that additional ablations and statistical analyses will strengthen the claims and will incorporate these in the revised manuscript.

read point-by-point responses
  1. Referee: The central claim that the framework equips models with 'genuine inductive capabilities' (abstract) is load-bearing but unsupported by isolating evidence. The three modules (token compression, attention rebalancing, inductive-deductive CoT) are bundled; without ablations that retain compression/rebalancing while removing the CoT (analyze-derive-apply) step, or direct probes of rule fidelity (e.g., consistency of derived rules across held-out demonstration sets or counterfactual query tests), benchmark gains cannot be attributed to rule extraction rather than noise reduction. The RL 'verifiable rewards' for citation and filtering are described but lack explicit rule-consistency metrics.

    Authors: We thank the referee for highlighting this critical aspect. Our framework is designed such that the inductive-deductive CoT is the core mechanism for enabling rule extraction, with compression and rebalancing serving as supporting visual-level enhancements. However, we acknowledge that the current presentation bundles the components. In the revision, we will add dedicated ablations that apply token compression and attention rebalancing without the analyze-derive-apply CoT steps, allowing direct comparison to the full model. Furthermore, we will include direct probes of rule fidelity, such as consistency checks of derived rules on held-out demonstration sets and evaluations on counterfactual query tests. For the RL pipeline, we will augment the reporting with explicit rule-consistency metrics alongside the verifiable rewards for citation and filtering. revision: yes

  2. Referee: §5 (Experiments): the abstract asserts 'consistent and significant improvements' across eight benchmarks for multiple VLMs, yet the manuscript supplies no statistical significance tests, run-to-run variance, or comparisons against strong baselines that apply only compression/rebalancing without the CoT. This weakens the cross-benchmark claim and leaves open whether the inductive-deductive component is necessary.

    Authors: We agree that rigorous statistical validation and targeted baselines are essential for substantiating the cross-benchmark claims. The revised manuscript will include statistical significance tests (such as t-tests across multiple random seeds) and report run-to-run variance (mean ± std) for all key results. Additionally, we will introduce strong baselines that incorporate only the similarity-based visual token compression and dynamic attention rebalancing without the inductive-deductive CoT, to explicitly demonstrate the necessity of the reasoning paradigm. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical framework with independent evaluation

full rationale

The paper introduces an empirical framework consisting of three modules (visual token compression, attention rebalancing, and inductive-deductive CoT) plus an auxiliary SFT+RL pipeline, then reports benchmark improvements. No mathematical derivation, prediction, or first-principles result is claimed that reduces to its own inputs by construction. No equations, fitted parameters renamed as predictions, or load-bearing self-citations appear in the provided text. The central claims rest on external benchmark evaluations rather than tautological redefinitions or self-referential loops, making the work self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The paper rests on standard domain assumptions in multimodal learning and ICL; no free parameters or invented entities are specified in the abstract.

axioms (2)
  • domain assumption Models often produce correct answers from flawed reasoning in multimodal ICL
    Stated as the fundamental limitation identified by analysis
  • domain assumption Redundant visual tokens obscure textual cues and attention is skewed toward the initial image
    Presented as two visual-level obstacles exacerbating the inductive gap

pith-pipeline@v0.9.0 · 5551 in / 1299 out tokens · 59748 ms · 2026-05-08T19:15:13.089180+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

81 extracted references · 29 canonical work pages · 12 internal anchors

  1. [1]

    Anthropic claude.https://claude.ai/, 2026

    Anthropic. Anthropic claude.https://claude.ai/, 2026

  2. [2]

    Vqa: Visual question answering

    Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C Lawrence Zitnick, and Devi Parikh. Vqa: Visual question answering. InProceedings of the IEEE international conference on computer vision, pages 2425–2433, 2015

  3. [3]

    Qwen2.5-VL Technical Report

    Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen2.5-vl technical report.arXiv preprint arXiv:2502.13923, 2025

  4. [4]

    Qwen3-VL Technical Report

    Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025

  5. [5]

    What makes multimodal in-context learning work? InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1539–1550, 2024

    Folco Bertini Baldassini, Mustafa Shukor, Matthieu Cord, Laure Soulier, and Benjamin Pi- wowarski. What makes multimodal in-context learning work? InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1539–1550, 2024

  6. [6]

    The role of deductive and inductive reasoning in large language models

    Chengkun Cai, Xu Zhao, Haoliang Liu, Zhongyu Jiang, Tianfang Zhang, Zongkai Wu, Jenq- Neng Hwang, and Lei Li. The role of deductive and inductive reasoning in large language models. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 16780–16790, 2025

  7. [7]

    Mm-iq: Benchmarking human-like abstraction and reasoning in multimodal models.arXiv preprint arXiv:2502.00698, 2025

    Huanqia Cai, Yijun Yang, and Winston Hu. Mm-iq: Benchmarking human-like abstraction and reasoning in multimodal models.arXiv preprint arXiv:2502.00698, 2025

  8. [8]

    Can multimodal large language models truly perform multimodal in-context learning? In2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 6000–6010

    Shuo Chen, Zhen Han, Bailan He, Jianzhe Liu, Mark Buckley, Yao Qin, Philip Torr, V olker Tresp, and Jindong Gu. Can multimodal large language models truly perform multimodal in-context learning? In2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 6000–6010. IEEE, 2025

  9. [9]

    True multimodal in-context learning needs attention to the visual context.arXiv preprint arXiv:2507.15807, 2025

    Shuo Chen, Jianzhe Liu, Zhen Han, Yan Xia, Daniel Cremers, Philip Torr, V olker Tresp, and Jindong Gu. True multimodal in-context learning needs attention to the visual context.arXiv preprint arXiv:2507.15807, 2025

  10. [10]

    Inductive or deductive? rethinking the fundamental reasoning abilities of llms.arXiv preprint arXiv:2408.00114, 2024

    Kewei Cheng, Jingfeng Yang, Haoming Jiang, Zhengyang Wang, Binxuan Huang, Ruirui Li, Shiyang Li, Zheng Li, Yifan Gao, Xian Li, et al. Inductive or deductive? rethinking the fundamental reasoning abilities of llms.arXiv preprint arXiv:2408.00114, 2024

  11. [11]

    A survey on in-context learning

    Qingxiu Dong, Lei Li, Damai Dai, Ce Zheng, Jingyuan Ma, Rui Li, Heming Xia, Jingjing Xu, Zhiyong Wu, Baobao Chang, et al. A survey on in-context learning. InProceedings of the 2024 conference on empirical methods in natural language processing, pages 1107–1128, 2024

  12. [12]

    Towards multimodal in-context learning for vision and language models

    Sivan Doveh, Shaked Perek, M Jehanzeb Mirza, Wei Lin, Amit Alfassy, Assaf Arbelle, Shimon Ullman, and Leonid Karlinsky. Towards multimodal in-context learning for vision and language models. InEuropean Conference on Computer Vision, pages 250–267. Springer, 2024

  13. [13]

    Semeval-2022 task 5: Multimedia automatic misogyny identification

    Elisabetta Fersini, Francesca Gasparini, Giulia Rizzi, Aurora Saibene, Berta Chulvi, Paolo Rosso, Alyssa Lees, and Jeffrey Sorensen. Semeval-2022 task 5: Multimedia automatic misogyny identification. InProceedings of the 16th International Workshop on Semantic Evaluation (SemEval-2022), pages 533–549, 2022. 10

  14. [14]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

  15. [15]

    Deepeyesv2: Toward agentic multimodal model

    Jack Hong, Chenxiao Zhao, ChengLin Zhu, Weiheng Lu, Guohai Xu, and Xing Yu. Deepeyesv2: Toward agentic multimodal model.arXiv preprint arXiv:2511.05271, 2025

  16. [16]

    Lora: Low-rank adaptation of large language models.Iclr, 1(2):3, 2022

    Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Liang Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models.Iclr, 1(2):3, 2022

  17. [17]

    Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models

    Wenxuan Huang, Bohan Jia, Zijie Zhai, Shaosheng Cao, Zheyu Ye, Fei Zhao, Zhe Xu, Xu Tang, Yao Hu, and Shaohui Lin. Vision-r1: Incentivizing reasoning capability in multimodal large language models.arXiv preprint arXiv:2503.06749, 2025

  18. [18]

    Csvqa: A chinese multimodal benchmark for evaluating stem reasoning capabilities of vlms.arXiv preprint arXiv:2505.24120, 2025

    Ai Jian, Weijie Qiu, Xiaokun Wang, Peiyu Wang, Yunzhuo Hao, Jiangbo Pei, Yichen Wei, Yi Peng, and Xuchen Song. Csvqa: A chinese multimodal benchmark for evaluating stem reasoning capabilities of vlms.arXiv preprint arXiv:2505.24120, 2025

  19. [19]

    Llave: Large language and vision embedding models with hardness-weighted contrastive learning.arXiv preprint arXiv:2503.04812, 2025

    Zhibin Lan, Liqiang Niu, Fandong Meng, Jie Zhou, and Jinsong Su. Llave: Large language and vision embedding models with hardness-weighted contrastive learning.arXiv preprint arXiv:2503.04812, 2025

  20. [20]

    Qwen3-VL-Embedding and Qwen3-VL-Reranker: A Unified Framework for State-of-the-Art Multimodal Retrieval and Ranking

    Mingxin Li, Yanzhao Zhang, Dingkun Long, Keqin Chen, Sibo Song, Shuai Bai, Zhibo Yang, Pengjun Xie, An Yang, Dayiheng Liu, et al. Qwen3-vl-embedding and qwen3-vl-reranker: A unified framework for state-of-the-art multimodal retrieval and ranking.arXiv preprint arXiv:2601.04720, 2026

  21. [21]

    M2IV:: Towards efficient and fine-grained multimodal in-context learning via representation engineering.arXiv preprint arXiv:2504.04633, 2025

    Yanshu Li, Yi Cao, Hongyang He, Qisen Cheng, Xiang Fu, Xi Xiao, Tianyang Wang, and Ruixiang Tang. M2IV:: Towards efficient and fine-grained multimodal in-context learning via representation engineering.arXiv preprint arXiv:2504.04633, 2025

  22. [22]

    Catp: Contextually adaptive token pruning for efficient and enhanced multimodal in-context learning

    Yanshu Li, Jianjiang Yang, Zhennan Shen, Ligong Han, Haoyan Xu, and Ruixiang Tang. Catp: Contextually adaptive token pruning for efficient and enhanced multimodal in-context learning. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 6619–6627, 2026

  23. [23]

    Make lvlms focus: Context-aware attention modulation for better multimodal in-context learning

    Yanshu Li, Jianjiang Yang, Ziteng Yang, Bozheng Li, Ligong Han, Hongyang He, Zhengtao Yao, Yingjie Victor Chen, Songlin Fei, Dongfang Liu, et al. Make lvlms focus: Context-aware attention modulation for better multimodal in-context learning. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 6610–6618, 2026

  24. [24]

    Implicit in-context learning.arXiv preprint arXiv:2405.14660, 2024

    Zhuowei Li, Zihao Xu, Ligong Han, Yunhe Gao, Song Wen, Di Liu, Hao Wang, and Dimitris N Metaxas. Implicit in-context learning.arXiv preprint arXiv:2405.14660, 2024

  25. [25]

    Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023

  26. [26]

    Infiguiagent: A multimodal generalist gui agent with native reasoning and reflection

    Yuhang Liu, Pengxiang Li, Zishu Wei, Congkai Xie, Xueyu Hu, Xinchen Xu, Shengyu Zhang, Xiaotian Han, Hongxia Yang, and Fei Wu. Infiguiagent: A multimodal generalist gui agent with native reasoning and reflection. InProceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), pages 103...

  27. [27]

    Sewon Min, Xinxi Lyu, Ari Holtzman, Mikel Artetxe, Mike Lewis, Hannaneh Hajishirzi, and Luke Zettlemoyer. Rethinking the role of demonstrations: What makes in-context learning work? InProceedings of the 2022 conference on empirical methods in natural language processing, pages 11048–11064, 2022

  28. [28]

    A new era of intelligence with gemini

    Sundar Pichai, Demis Hassabis, and Koray Kavukcuoglu. A new era of intelligence with gemini

  29. [29]

    https://blog.google/products-and-platforms/products/gemini/gemini-3/ , 2025. 11

  30. [30]

    Detecting harmful memes and their targets

    Shraman Pramanick, Dimitar Dimitrov, Rituparna Mukherjee, Shivam Sharma, Md Shad Akhtar, Preslav Nakov, and Tanmoy Chakraborty. Detecting harmful memes and their targets. In Findings of the association for computational linguistics: ACL-IJCNLP 2021, pages 2783–2796, 2021

  31. [31]

    Bytedance Seed. Seed1. 8 model card: Towards generalized real-world agency.arXiv preprint arXiv:2603.20633, 2026

  32. [32]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

  33. [33]

    OpenAI GPT-5 System Card

    Aaditya Singh, Adam Fry, Adam Perelman, Adam Tart, Adi Ganesh, Ahmed El-Kishky, Aidan McLaughlin, Aiden Low, AJ Ostrow, Akhila Ananthram, et al. Openai gpt-5 system card.arXiv preprint arXiv:2601.03267, 2025

  34. [34]

    Towards vqa models that can read

    Amanpreet Singh, Vivek Natarajan, Meet Shah, Yu Jiang, Xinlei Chen, Dhruv Batra, Devi Parikh, and Marcus Rohrbach. Towards vqa models that can read. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8317–8326, 2019

  35. [35]

    Thinking with Images for Multimodal Reasoning: Foundations, Methods, and Future Frontiers

    Zhaochen Su, Peng Xia, Hangyu Guo, Zhenhua Liu, Yan Ma, Xiaoye Qu, Jiaqi Liu, Yanshu Li, Kaide Zeng, Zhengyuan Yang, et al. Thinking with images for multimodal reasoning: Foundations, methods, and future frontiers.arXiv preprint arXiv:2506.23918, 2025

  36. [36]

    Meta-analysis of cohen’s kappa.Health Services and Outcomes Research Methodology, 11(3):145–163, 2011

    Shuyan Sun. Meta-analysis of cohen’s kappa.Health Services and Outcomes Research Methodology, 11(3):145–163, 2011

  37. [37]

    Kimi Team, Tongtong Bai, Yifan Bai, Yiping Bao, SH Cai, Yuan Cao, Y Charles, HS Che, Cheng Chen, Guanduo Chen, et al. Kimi k2. 5: Visual agentic intelligence.arXiv preprint arXiv:2602.02276, 2026

  38. [38]

    Llamav-o1: Rethinking step-by-step visual reasoning in llms

    Omkar Thawakar, Dinura Dissanayake, Ketan Pravin More, Ritesh Thawkar, Ahmed Heakl, Noor Ahsan, Yuhao Li, Ilmuz Zaman Mohammed Zumri, Jean Lahoud, Rao Muhammad Anwer, et al. Llamav-o1: Rethinking step-by-step visual reasoning in llms. InFindings of the Association for Computational Linguistics: ACL 2025, pages 24290–24315, 2025

  39. [39]

    Identifying and mitigating position bias of multi-image vision-language models

    Xinyu Tian, Shu Zou, Zhaoyuan Yang, and Jing Zhang. Identifying and mitigating position bias of multi-image vision-language models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 10599–10609, 2025

  40. [40]

    Mmifevol: Towards evolutionary multimodal instruction following

    Haoyu Wang, Sihang Jiang, Xiangru Zhu, Yuyan Chen, Xiaojun Meng, Jiansheng Wei, Yitong Wang, and Yanghua Xiao. Mmifevol: Towards evolutionary multimodal instruction following. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 26206–26214, 2026

  41. [41]

    Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

    Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024

  42. [42]

    Finetuned Language Models Are Zero-Shot Learners

    Jason Wei, Maarten Bosma, Vincent Y Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M Dai, and Quoc V Le. Finetuned language models are zero-shot learners.arXiv preprint arXiv:2109.01652, 2021

  43. [43]

    Chain-of-thought prompting elicits reasoning in large language models

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35:24824–24837, 2022

  44. [44]

    The learnability of in-context learning

    Noam Wies, Yoav Levine, and Amnon Shashua. The learnability of in-context learning. Advances in Neural Information Processing Systems, 36:36637–36651, 2023

  45. [45]

    Mmsearch-r1: Incentivizing lmms to search.arXiv preprint arXiv:2506.20670, 2025

    Jinming Wu, Zihao Deng, Wei Li, Yiding Liu, Bo You, Bo Li, Zejun Ma, and Ziwei Liu. Mmsearch-r1: Incentivizing lmms to search.arXiv preprint arXiv:2506.20670, 2025. 12

  46. [46]

    Improve multi-modal embedding learning via explicit hard negative gradient amplifying.arXiv preprint arXiv:2506.02020, 2025

    Youze Xue, Dian Li, and Gang Liu. Improve multi-modal embedding learning via explicit hard negative gradient amplifying.arXiv preprint arXiv:2506.02020, 2025

  47. [47]

    R1-onevision: Advancing generalized multimodal reasoning through cross-modal formalization

    Yi Yang, Xiaoxuan He, Hongkun Pan, Xiyan Jiang, Yan Deng, Xingtao Yang, Haoyu Lu, Dacheng Yin, Fengyun Rao, Minfeng Zhu, et al. R1-onevision: Advancing generalized multimodal reasoning through cross-modal formalization. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 2376–2385, 2025

  48. [48]

    Cc-ocr: A comprehensive and challenging ocr benchmark for evaluating large multimodal models in literacy

    Zhibo Yang, Jun Tang, Zhaohai Li, Pengfei Wang, Jianqiang Wan, Humen Zhong, Xuejing Liu, Mingkun Yang, Peng Wang, Shuai Bai, et al. Cc-ocr: A comprehensive and challenging ocr benchmark for evaluating large multimodal models in literacy. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 21744–21754, 2025

  49. [49]

    Mm-deepresearch: A simple and effective multimodal agentic search baseline.arXiv preprint arXiv:2603.01050, 2026

    Huanjin Yao, Qixiang Yin, Min Yang, Ziwang Zhao, Yibo Wang, Haotian Luo, Jingyi Zhang, and Jiaxing Huang. Mm-deepresearch: A simple and effective multimodal agentic search baseline.arXiv preprint arXiv:2603.01050, 2026

  50. [50]

    DAPO: An Open-Source LLM Reinforcement Learning System at Scale

    Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, et al. Dapo: An open-source llm reinforcement learning system at scale.arXiv preprint arXiv:2503.14476, 2025

  51. [51]

    Mme-reasoning: A comprehensive benchmark for logical reasoning in mllms.arXiv preprint arXiv:2505.21327, 2025

    Jiakang Yuan, Tianshuo Peng, Yilei Jiang, Yiting Lu, Renrui Zhang, Kaituo Feng, Chaoyou Fu, Tao Chen, Lei Bai, Bo Zhang, et al. Mme-reasoning: A comprehensive benchmark for logical reasoning in mllms.arXiv preprint arXiv:2505.21327, 2025

  52. [52]

    Multimodal Chain-of-Thought Reasoning in Language Models

    Zhuosheng Zhang, Aston Zhang, Mu Li, Hai Zhao, George Karypis, and Alex Smola. Mul- timodal chain-of-thought reasoning in language models.arXiv preprint arXiv:2302.00923, 2023

  53. [53]

    R1-omni: Explainable omni- multimodal emotion recognition with reinforcing learning,

    Jiaxing Zhao, Xihan Wei, and Liefeng Bo. R1-omni: Explainable omni-multimodal emotion recognition with reinforcement learning.arXiv preprint arXiv:2503.05379, 2025

  54. [54]

    Swift: a scalable lightweight infrastructure for fine-tuning

    Yuze Zhao, Jintao Huang, Jinghan Hu, Xingjun Wang, Yunlin Mao, Daoze Zhang, Zeyinzi Jiang, Zhikai Wu, Baole Ai, Ang Wang, et al. Swift: a scalable lightweight infrastructure for fine-tuning. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 29733–29735, 2025

  55. [55]

    DeepEyes: Incentivizing "Thinking with Images" via Reinforcement Learning

    Ziwei Zheng, Michael Yang, Jack Hong, Chenxiao Zhao, Guohai Xu, Le Yang, Chao Shen, and Xing Yu. Deepeyes: Incentivizing" thinking with images" via reinforcement learning.arXiv preprint arXiv:2505.14362, 2025

  56. [56]

    Mdk12-bench: a multi-discipline benchmark for evaluating reasoning in multimodal large language models

    Pengfei Zhou, Xiaopeng Peng, Fanrui Zhang, Zhaopan Xu, Jiaxin Ai, Yansheng Qiu, Wangbo Zhao, Jiajun Song, Chuanhao Li, Weidong Tang, et al. Mdk12-bench: a multi-discipline benchmark for evaluating reasoning in multimodal large language models. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 28982–28990, 2026

  57. [57]

    highlighted tokens

    Xin Zou, Di Lu, Yizhou Wang, Yibo Yan, Yuanhuiyi Lyu, Xu Zheng, Linfeng Zhang, and Xuming Hu. Don’t just chase “highlighted tokens” in mllms: Revisiting visual holistic context retention. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025. A Supplementary Experiment To comprehensively evaluate the robustness and generaliza...

  58. [58]

    inductive correctness gap

    coefficient of 0.78, demonstrating a high degree of alignment and confirming the validity of our automated evaluation pipeline. C Dataset Details VQAv2 [2]is a large-scale visual question answering benchmark. The dataset is built upon images from the Microsoft COCO dataset and contains open-ended questions that require joint understanding of visual conten...

  59. [59]

    Identify the core task, key visual elements, and the implicit logic required to solve it

    Target Problem Analysis: First, carefully observe the Target Question and its image. Identify the core task, key visual elements, and the implicit logic required to solve it

  60. [60]

    Reference Case Evaluation: Analyze each reference case one by one. For each case, you must explicitly evaluate: • Visual & Textual Elements: What is shown in the image and asked in the text? • Solution Logic: What is the underlying rule, formula, or method used to solve this case? • Relevance Comparison: Compare its task type and solution logic with the T...

  61. [61]

    helpful". Ignore

    Rule Induction: Synthesize a general problem-solving rule ONLY from the cases judged as "helpful". Ignore "unhelpful" cases completely. Use tags like "<|case 1|>" to cite them

  62. [62]

    I don’t know,

    Final Answer: Apply the induced rule to solve the Target Question step-by-step. Use the following strict template for your output: <|begin_of_target_analysis|> [Core Task]: {What is the user asking?} [Key Visuals]: {What are the crucial elements in the target image?} [Required Logic]: {What kind of reasoning or method is likely needed?} <|end_of_target_an...

  63. [63]

    Decompose the standard answer into individual sub-answers (e.g., for a single answer: one sub-answer; for a list/sequence: each element is a sub-answer)

  64. [64]

    For each sub-answer, determine independently whether the student’s response contains a correct match (Correct or Wrong)

  65. [65]

    The final score = number of correct sub-answers / total number of sub-answers. Output Format Use the following XML tags: <sub_answers> List each sub-answer on a separate line in the format: [sub-answer content]: Correct or Wrong </sub_answers> <reason>1–3 concise sentences summarizing the grading result.</reason> independently, this prompt guarantees a hi...

  66. [66]

    The black triangles are consistently oriented to the left

    Observe the first group of three stars. The black triangles are consistently oriented to the left

  67. [67]

    The black triangles are consistently oriented to the right

    Observe the second group of three stars. The black triangles are consistently oriented to the right

  68. [68]

    Apply the extracted rule: The entire sequence is governed by a consistent geometric transformation applied to the entire star shape

  69. [69]

    Option B shows the black triangles oriented to the right, matching the second group

    Evaluate the options: Option A shows the black triangles oriented to the left, matching the first group. Option B shows the black triangles oriented to the right, matching the second group. Option C shows the black triangles oriented to the left, but the overall shape is rotated clockwise compared to the first group. Option D shows the black triangles ori...

  70. [70]

    [Final Result]: B <|end_of_answer|> Figure 8: Case study on the MMIQ dataset

    Conclusion: Option B is the only option that perfectly matches the geometric transformation required to complete the sequence. [Final Result]: B <|end_of_answer|> Figure 8: Case study on the MMIQ dataset. Vanilla Gemini-3.1-pro and Qwen3-VL fail to reach the correct answer, while ourMMInductioninduces the underlying rule from retrieved demonstrations and ...

  71. [71]

    **Analyze the diagrams (Stages A and B):** * **Figure A (Jia):** The cell has 4 chromosomes and 8 nuclear DNA molecules. Since the normal somatic cell of a fruit fly has 2n= 8 chromosomes, a cell with 4 chromosomes and 8 DNA molecules indicates that homologous chromosomes have separated, but sister chromatids have not. This corresponds to a secondary sper...

  72. [72]

    During Meiosis I, homologous chromosomes (X and Y) separate into different secondary spermatocytes

    **Analyze the sex chromosomes (Option B):** * A male fruit fly has sex chromosomes XY . During Meiosis I, homologous chromosomes (X and Y) separate into different secondary spermatocytes. * Thus, a secondary spermatocyte will contain either one X or one Y chromosome. * In stage B (Anaphase II), the centromeres split. A cell that received the X chromosome ...

  73. [73]

    A crossover occurred between non-sister chromatids of homologous chromosomes, producing 4 chromatids for this bivalent:AB,Ab,aB, andab

    **Analyze the genotypes of sperms produced (Option C):** * The primary spermatocyte has the genotype AaBb. A crossover occurred between non-sister chromatids of homologous chromosomes, producing 4 chromatids for this bivalent:AB,Ab,aB, andab. * During Meiosis I, the homologous centromeres separate. * If the centromere attached to the A allele goes to one ...

  74. [74]

    **Analyze the test-cross of a male fruit fly (Option D):** * A fundamental rule in genetics is that **male Drosophila do not normally undergo crossing over** during meiosis; their linked genes exhibit complete linkage. 27 * Although the problem states that a rare crossover occurred in *one specific spermatogonium*, the vast majority of spermatogonia will ...

  75. [75]

    This corresponds to the prophase I stage of meiosis when homologous chromosomes pair up and undergo crossing over

    **Analyze the Bar Graphs**: The first graph (Jia) shows 4 chromosomes and 8 DNA molecules, indicating that each chromosome consists of 2 sister chromatids. This corresponds to the prophase I stage of meiosis when homologous chromosomes pair up and undergo crossing over. The second graph (Yi) shows 8 chromosomes and 8 DNA molecules, indicating that each ch...

  76. [76]

    At this stage, the cell contains 4 chromosomes (2 pairs of homologues) and 4 X chromosomes (2 pairs of homologues)

    **Determine the Stage**: Since the number of chromosomes doubles from 4 to 8, the cell must be undergoing metaphase I . At this stage, the cell contains 4 chromosomes (2 pairs of homologues) and 4 X chromosomes (2 pairs of homologues). The cell cannot be in metaphase II because M II would show 4 chromosomes (2 pairs of sister chromatids) and 4 DNA molecules

  77. [77]

    Statement A is correct

    **Evaluate Option A**: Both stages (prophase I and metaphase I) involve 4 pairs of sister chromatids, so the number of chromatids is always 8. Statement A is correct

  78. [78]

    Statement B is correct

    **Evaluate Option B**: During metaphase I, the cell may contain either 0 or 2 X chromosomes depending on whether the homologous X chromosomes are aligned on opposite poles. Statement B is correct

  79. [79]

    Since the male fruit fly has the genotype AaBb, the other possible sperm genotypes are ab and aB

    **Evaluate Option C**: The sperm genotype is Ab. Since the male fruit fly has the genotype AaBb, the other possible sperm genotypes are ab and aB. Therefore, the genotype of a sperm from another secondary spermatocyte can indeed be either ab or aB. Statement C is correct

  80. [80]

    Broke up after 2 years of dating

    **Evaluate Option D**: The given offspring segregation ratio (45:5:5:45) is characteristic of a test cross between a heterozygous recessive individual (aaBb) and a homozygous recessive individual (aabb). This indicates that the male fruit fly is heterozygous for the B locus (AaBb). However, the question asks which statement is *incorrect*. Since options A...

Showing first 80 references.