arxiv: 2605.02378 · v1 · submitted 2026-05-04 · 💻 cs.CV · cs.AI

Recognition: 1 theorem link

Enhancing Multimodal In-Context Learning via Inductive-Deductive Reasoning

Haoyu Wang , Haonan Wang , Yuyan Chen , Jun Chen , Gang Liu , Qian Wang , Jiahong Yan , Yanghua Xiao

Authors on Pith no claims yet

Pith reviewed 2026-05-08 19:15 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords multimodal in-context learninginductive reasoningvision-language modelschain-of-thoughtvisual token compressionattention rebalancingreinforcement learning

0 comments

The pith

A framework turns fragile multimodal in-context learning into explicit inductive-deductive rule extraction for vision-language models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper identifies that vision-language models often reach correct answers through flawed or inconsistent reasoning when given few-shot examples, rather than extracting reliable rules from the demonstrations. This inductive gap grows worse because redundant visual patches flood the context and attention skews toward the first image. The authors restructure the process around three targeted fixes: compressing similar visual tokens to reduce noise, rebalancing attention so every image receives fair focus, and a chain-of-thought sequence that first examines each example, derives a general rule, then applies that rule to the query. An added training loop of supervised fine-tuning plus reinforcement learning rewards faithful rule citation and noise rejection. If these changes succeed, multimodal models should start behaving more like genuine inductive reasoners across perception, logic, STEM, and sarcasm tasks.

Core claim

The central claim is that multimodal ICL fails primarily due to an inductive gap in which models do not extract consistent rules across visual demonstrations, compounded by redundant visual tokens and imbalanced attention; this can be remedied by recasting ICL as an explicit inductive-deductive pipeline that compresses visual tokens by similarity, rebalances attention dynamically, and forces the model to analyze examples, induce a rule, and deductively apply it, reinforced by auxiliary reinforcement learning that rewards verifiable rule use.

What carries the argument

The inductive-deductive chain-of-thought process, which separates per-example analysis, general-rule derivation, and query application, supported by similarity-based visual token compression and dynamic attention rebalancing.

If this is right

Vision-language models will produce correct answers grounded in extracted rules rather than flawed reasoning across visual perception, logical reasoning, STEM, and sarcasm tasks.
Reducing redundant visual tokens and rebalancing attention will allow later demonstrations to influence reasoning as much as the first image.
The auxiliary reinforcement-learning stage will increase the frequency of faithful citation of the induced rule and filtering of irrelevant visual noise.
The same framework will deliver consistent gains when applied to multiple open-source vision-language models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same inductive-deductive separation could be tested in pure text in-context learning to see whether explicit rule derivation improves consistency there as well.
The token-compression step might reduce context length enough to allow longer demonstration sets without exceeding model limits.
If the rebalancing mechanism proves general, similar attention adjustments could be applied to other multi-image or multi-turn vision tasks outside of in-context learning.

Load-bearing premise

That the performance gains come from genuine rule extraction enabled by the new modules rather than from incidental effects of the added training or prompting steps.

What would settle it

A controlled test that measures whether the model, when prompted to state the derived rule after the analysis step, produces rules that accurately capture the shared pattern in the demonstrations and that using those rules predicts final answer correctness better than standard ICL.

Figures

Figures reproduced from arXiv: 2605.02378 by Gang Liu, Haonan Wang, Haoyu Wang, Jiahong Yan, Jun Chen, Qian Wang, Yanghua Xiao, Yuyan Chen.

**Figure 1.** Figure 1: Examples of naive and inductive multimodal ICL. For visual perception-based QA, VLMs view at source ↗

**Figure 3.** Figure 3: Attention heatmap visualization for VLMs during view at source ↗

**Figure 4.** Figure 4: Overview of the proposed MMInduction framework. (1) Similarity-Guided Visual Token view at source ↗

**Figure 5.** Figure 5: Training dynamics of the three fine-grained reward components— view at source ↗

**Figure 6.** Figure 6: Attention heatmaps of VLMs after MMInduction training, demonstrating a balanced view at source ↗

**Figure 7.** Figure 7: Case study on the VQA dataset. Dataset: MMIQ | Task: Multimodal Knowledge Reasoning | Case ID: #364 Query Q: Choose the most appropriate option from the given four choices to fill in the question mark, so that it presents a certain regularity: GT Answer: B Retrieved Cases Q1: Choose the most appropriate option from the given four choices to fill in the question mark, so that it presents a certain regularit… view at source ↗

**Figure 8.** Figure 8: Case study on the MMIQ dataset. Vanilla Gemini-3.1-pro and Qwen3-VL fail to reach the view at source ↗

**Figure 9.** Figure 9: Failure case on the MDK12 dataset where all three methods (Vanilla Gemini-3.1-pro, view at source ↗

read the original abstract

In-context learning (ICL) allows large models to adapt to tasks using a few examples, yet its extension to vision-language models (VLMs) remains fragile. Our analysis reveals that the fundamental limitation lies in an inductive gap, models often produce correct answers from flawed reasoning, while struggling to extract consistent rules across demonstrations. This gap is further exacerbated by two visual-level obstacles: an overwhelming proportion of redundant visual tokens that obscure textual cues, and a skewed attention distribution that favors the initial image at the expense of subsequent context. To address these issues, we introduce a framework that restructures multimodal ICL as a principled inductive-deductive process. The framework incorporates a similarity-based visual token compression module to filter out redundant patches, a dynamic attention rebalancing mechanism to distribute focus equitably across all images, and a chain-of-thought paradigm that explicitly guides the model to analyze individual examples, derive a generalizable rule, and then apply it to the query. An auxiliary learning pipeline combines supervised fine-tuning with reinforcement learning using verifiable rewards to reinforce faithful citation and noise filtering. Evaluations across eight benchmarks covering visual perception, logical reasoning, STEM problems, and sarcasm detection demonstrate consistent and significant improvements over standard ICL baselines for multiple open-source VLMs, highlighting the potential of equipping models with genuine inductive capabilities in multimodal settings.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper bundles visual token compression and attention fixes with an inductive-deductive CoT, but the gains could mostly trace to the visual modules rather than rule extraction.

read the letter

The main thing here is a practical framework that tries to close the inductive gap in multimodal ICL by adding three pieces: similarity-based compression to drop redundant image patches, dynamic attention rebalancing so later images aren't ignored, and an explicit CoT that makes the model analyze examples, pull out a rule, then apply it. They back this with SFT plus RL using rewards for faithful citation and noise filtering, and they show gains across eight benchmarks on perception, reasoning, STEM, and sarcasm tasks for open VLMs. That combination of targeted visual cleanup and structured reasoning is the new bit, and it directly attacks two real problems the abstract lays out clearly: token overload and skewed attention. The training pipeline is also a solid engineering choice for reinforcing the desired behavior without needing hand-crafted rules at inference time. The soft spot is exactly the one the stress-test flags. Without ablations that turn the CoT on and off while holding the compression and rebalancing fixed, or direct checks like rule consistency across different demonstration sets, it's hard to know whether the model is actually learning to extract generalizable rules or just getting cleaner inputs and more balanced processing. The abstract claims consistent improvements but doesn't give the isolation data or the counterfactual probes that would pin down the mechanism. If the full paper has those, the claim strengthens; if not, the visual modules are probably carrying most of the lift. This is for people working on ICL robustness in VLMs who want concrete modules they can try. It deserves peer review because the problems are real, the proposals are specific, and the benchmark coverage is broad enough to be worth referee scrutiny even if the causal story needs tightening.

Referee Report

2 major / 2 minor

Summary. The paper claims that multimodal ICL in VLMs is limited by an 'inductive gap' (correct answers from flawed reasoning, failure to extract consistent rules across demonstrations), worsened by redundant visual tokens and skewed attention favoring the first image. It proposes a framework restructuring ICL as inductive-deductive: similarity-based visual token compression to filter redundant patches, dynamic attention rebalancing for equitable focus, and a CoT paradigm that analyzes examples, derives a generalizable rule, then applies it to the query. An auxiliary SFT+RL pipeline with verifiable rewards reinforces faithful citation and noise filtering. Evaluations on eight benchmarks (visual perception, logical reasoning, STEM, sarcasm detection) show consistent significant gains over standard ICL for open-source VLMs.

Significance. If the gains are shown to stem specifically from the inductive-deductive CoT enabling rule extraction (rather than from token compression or attention fixes), the work would meaningfully advance multimodal ICL by offering a principled separation of induction and deduction. The verifiable-reward RL component is a constructive element for controllable training. The result would highlight a path toward more reliable generalization from visual demonstrations, with potential impact on reasoning-heavy VLM applications.

major comments (2)

The central claim that the framework equips models with 'genuine inductive capabilities' (abstract) is load-bearing but unsupported by isolating evidence. The three modules (token compression, attention rebalancing, inductive-deductive CoT) are bundled; without ablations that retain compression/rebalancing while removing the CoT (analyze-derive-apply) step, or direct probes of rule fidelity (e.g., consistency of derived rules across held-out demonstration sets or counterfactual query tests), benchmark gains cannot be attributed to rule extraction rather than noise reduction. The RL 'verifiable rewards' for citation and filtering are described but lack explicit rule-consistency metrics.
[§5] §5 (Experiments): the abstract asserts 'consistent and significant improvements' across eight benchmarks for multiple VLMs, yet the manuscript supplies no statistical significance tests, run-to-run variance, or comparisons against strong baselines that apply only compression/rebalancing without the CoT. This weakens the cross-benchmark claim and leaves open whether the inductive-deductive component is necessary.

minor comments (2)

The term 'inductive gap' is introduced without a precise operational definition or contrast to standard ICL failure modes; a short formalization or illustrative example in the introduction would improve clarity.
The eight benchmarks are referenced in the abstract but not enumerated with citations or task descriptions in the provided text; this should be added to §1 or §5 for reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for the constructive review and for recognizing the potential of our inductive-deductive framework in advancing multimodal in-context learning. We address the major comments point by point below. We agree that additional ablations and statistical analyses will strengthen the claims and will incorporate these in the revised manuscript.

read point-by-point responses

Referee: The central claim that the framework equips models with 'genuine inductive capabilities' (abstract) is load-bearing but unsupported by isolating evidence. The three modules (token compression, attention rebalancing, inductive-deductive CoT) are bundled; without ablations that retain compression/rebalancing while removing the CoT (analyze-derive-apply) step, or direct probes of rule fidelity (e.g., consistency of derived rules across held-out demonstration sets or counterfactual query tests), benchmark gains cannot be attributed to rule extraction rather than noise reduction. The RL 'verifiable rewards' for citation and filtering are described but lack explicit rule-consistency metrics.

Authors: We thank the referee for highlighting this critical aspect. Our framework is designed such that the inductive-deductive CoT is the core mechanism for enabling rule extraction, with compression and rebalancing serving as supporting visual-level enhancements. However, we acknowledge that the current presentation bundles the components. In the revision, we will add dedicated ablations that apply token compression and attention rebalancing without the analyze-derive-apply CoT steps, allowing direct comparison to the full model. Furthermore, we will include direct probes of rule fidelity, such as consistency checks of derived rules on held-out demonstration sets and evaluations on counterfactual query tests. For the RL pipeline, we will augment the reporting with explicit rule-consistency metrics alongside the verifiable rewards for citation and filtering. revision: yes
Referee: §5 (Experiments): the abstract asserts 'consistent and significant improvements' across eight benchmarks for multiple VLMs, yet the manuscript supplies no statistical significance tests, run-to-run variance, or comparisons against strong baselines that apply only compression/rebalancing without the CoT. This weakens the cross-benchmark claim and leaves open whether the inductive-deductive component is necessary.

Authors: We agree that rigorous statistical validation and targeted baselines are essential for substantiating the cross-benchmark claims. The revised manuscript will include statistical significance tests (such as t-tests across multiple random seeds) and report run-to-run variance (mean ± std) for all key results. Additionally, we will introduce strong baselines that incorporate only the similarity-based visual token compression and dynamic attention rebalancing without the inductive-deductive CoT, to explicitly demonstrate the necessity of the reasoning paradigm. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical framework with independent evaluation

full rationale

The paper introduces an empirical framework consisting of three modules (visual token compression, attention rebalancing, and inductive-deductive CoT) plus an auxiliary SFT+RL pipeline, then reports benchmark improvements. No mathematical derivation, prediction, or first-principles result is claimed that reduces to its own inputs by construction. No equations, fitted parameters renamed as predictions, or load-bearing self-citations appear in the provided text. The central claims rest on external benchmark evaluations rather than tautological redefinitions or self-referential loops, making the work self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The paper rests on standard domain assumptions in multimodal learning and ICL; no free parameters or invented entities are specified in the abstract.

axioms (2)

domain assumption Models often produce correct answers from flawed reasoning in multimodal ICL
Stated as the fundamental limitation identified by analysis
domain assumption Redundant visual tokens obscure textual cues and attention is skewed toward the initial image
Presented as two visual-level obstacles exacerbating the inductive gap

pith-pipeline@v0.9.0 · 5551 in / 1299 out tokens · 59748 ms · 2026-05-08T19:15:13.089180+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

81 extracted references · 29 canonical work pages · 12 internal anchors

[1]

Anthropic claude.https://claude.ai/, 2026

Anthropic. Anthropic claude.https://claude.ai/, 2026

2026
[2]

Vqa: Visual question answering

Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C Lawrence Zitnick, and Devi Parikh. Vqa: Visual question answering. InProceedings of the IEEE international conference on computer vision, pages 2425–2433, 2015

2015
[3]

Qwen2.5-VL Technical Report

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen2.5-vl technical report.arXiv preprint arXiv:2502.13923, 2025

work page Pith review arXiv 2025
[4]

Qwen3-VL Technical Report

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025

work page Pith review arXiv 2025
[5]

What makes multimodal in-context learning work? InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1539–1550, 2024

Folco Bertini Baldassini, Mustafa Shukor, Matthieu Cord, Laure Soulier, and Benjamin Pi- wowarski. What makes multimodal in-context learning work? InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1539–1550, 2024

2024
[6]

The role of deductive and inductive reasoning in large language models

Chengkun Cai, Xu Zhao, Haoliang Liu, Zhongyu Jiang, Tianfang Zhang, Zongkai Wu, Jenq- Neng Hwang, and Lei Li. The role of deductive and inductive reasoning in large language models. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 16780–16790, 2025

2025
[7]

Mm-iq: Benchmarking human-like abstraction and reasoning in multimodal models.arXiv preprint arXiv:2502.00698, 2025

Huanqia Cai, Yijun Yang, and Winston Hu. Mm-iq: Benchmarking human-like abstraction and reasoning in multimodal models.arXiv preprint arXiv:2502.00698, 2025

work page arXiv 2025
[8]

Can multimodal large language models truly perform multimodal in-context learning? In2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 6000–6010

Shuo Chen, Zhen Han, Bailan He, Jianzhe Liu, Mark Buckley, Yao Qin, Philip Torr, V olker Tresp, and Jindong Gu. Can multimodal large language models truly perform multimodal in-context learning? In2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 6000–6010. IEEE, 2025

2025
[9]

True multimodal in-context learning needs attention to the visual context.arXiv preprint arXiv:2507.15807, 2025

Shuo Chen, Jianzhe Liu, Zhen Han, Yan Xia, Daniel Cremers, Philip Torr, V olker Tresp, and Jindong Gu. True multimodal in-context learning needs attention to the visual context.arXiv preprint arXiv:2507.15807, 2025

work page arXiv 2025
[10]

Inductive or deductive? rethinking the fundamental reasoning abilities of llms.arXiv preprint arXiv:2408.00114, 2024

Kewei Cheng, Jingfeng Yang, Haoming Jiang, Zhengyang Wang, Binxuan Huang, Ruirui Li, Shiyang Li, Zheng Li, Yifan Gao, Xian Li, et al. Inductive or deductive? rethinking the fundamental reasoning abilities of llms.arXiv preprint arXiv:2408.00114, 2024

work page arXiv 2024
[11]

A survey on in-context learning

Qingxiu Dong, Lei Li, Damai Dai, Ce Zheng, Jingyuan Ma, Rui Li, Heming Xia, Jingjing Xu, Zhiyong Wu, Baobao Chang, et al. A survey on in-context learning. InProceedings of the 2024 conference on empirical methods in natural language processing, pages 1107–1128, 2024

2024
[12]

Towards multimodal in-context learning for vision and language models

Sivan Doveh, Shaked Perek, M Jehanzeb Mirza, Wei Lin, Amit Alfassy, Assaf Arbelle, Shimon Ullman, and Leonid Karlinsky. Towards multimodal in-context learning for vision and language models. InEuropean Conference on Computer Vision, pages 250–267. Springer, 2024

2024
[13]

Semeval-2022 task 5: Multimedia automatic misogyny identification

Elisabetta Fersini, Francesca Gasparini, Giulia Rizzi, Aurora Saibene, Berta Chulvi, Paolo Rosso, Alyssa Lees, and Jeffrey Sorensen. Semeval-2022 task 5: Multimedia automatic misogyny identification. InProceedings of the 16th International Workshop on Semantic Evaluation (SemEval-2022), pages 533–549, 2022. 10

2022
[14]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

work page internal anchor Pith review arXiv 2025
[15]

Deepeyesv2: Toward agentic multimodal model

Jack Hong, Chenxiao Zhao, ChengLin Zhu, Weiheng Lu, Guohai Xu, and Xing Yu. Deepeyesv2: Toward agentic multimodal model.arXiv preprint arXiv:2511.05271, 2025

work page arXiv 2025
[16]

Lora: Low-rank adaptation of large language models.Iclr, 1(2):3, 2022

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Liang Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models.Iclr, 1(2):3, 2022

2022
[17]

Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models

Wenxuan Huang, Bohan Jia, Zijie Zhai, Shaosheng Cao, Zheyu Ye, Fei Zhao, Zhe Xu, Xu Tang, Yao Hu, and Shaohui Lin. Vision-r1: Incentivizing reasoning capability in multimodal large language models.arXiv preprint arXiv:2503.06749, 2025

work page internal anchor Pith review arXiv 2025
[18]

Csvqa: A chinese multimodal benchmark for evaluating stem reasoning capabilities of vlms.arXiv preprint arXiv:2505.24120, 2025

Ai Jian, Weijie Qiu, Xiaokun Wang, Peiyu Wang, Yunzhuo Hao, Jiangbo Pei, Yichen Wei, Yi Peng, and Xuchen Song. Csvqa: A chinese multimodal benchmark for evaluating stem reasoning capabilities of vlms.arXiv preprint arXiv:2505.24120, 2025

work page arXiv 2025
[19]

Llave: Large language and vision embedding models with hardness-weighted contrastive learning.arXiv preprint arXiv:2503.04812, 2025

Zhibin Lan, Liqiang Niu, Fandong Meng, Jie Zhou, and Jinsong Su. Llave: Large language and vision embedding models with hardness-weighted contrastive learning.arXiv preprint arXiv:2503.04812, 2025

work page arXiv 2025
[20]

Qwen3-VL-Embedding and Qwen3-VL-Reranker: A Unified Framework for State-of-the-Art Multimodal Retrieval and Ranking

Mingxin Li, Yanzhao Zhang, Dingkun Long, Keqin Chen, Sibo Song, Shuai Bai, Zhibo Yang, Pengjun Xie, An Yang, Dayiheng Liu, et al. Qwen3-vl-embedding and qwen3-vl-reranker: A unified framework for state-of-the-art multimodal retrieval and ranking.arXiv preprint arXiv:2601.04720, 2026

work page internal anchor Pith review arXiv 2026
[21]

M2IV:: Towards efficient and fine-grained multimodal in-context learning via representation engineering.arXiv preprint arXiv:2504.04633, 2025

Yanshu Li, Yi Cao, Hongyang He, Qisen Cheng, Xiang Fu, Xi Xiao, Tianyang Wang, and Ruixiang Tang. M2IV:: Towards efficient and fine-grained multimodal in-context learning via representation engineering.arXiv preprint arXiv:2504.04633, 2025

work page arXiv 2025
[22]

Catp: Contextually adaptive token pruning for efficient and enhanced multimodal in-context learning

Yanshu Li, Jianjiang Yang, Zhennan Shen, Ligong Han, Haoyan Xu, and Ruixiang Tang. Catp: Contextually adaptive token pruning for efficient and enhanced multimodal in-context learning. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 6619–6627, 2026

2026
[23]

Make lvlms focus: Context-aware attention modulation for better multimodal in-context learning

Yanshu Li, Jianjiang Yang, Ziteng Yang, Bozheng Li, Ligong Han, Hongyang He, Zhengtao Yao, Yingjie Victor Chen, Songlin Fei, Dongfang Liu, et al. Make lvlms focus: Context-aware attention modulation for better multimodal in-context learning. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 6610–6618, 2026

2026
[24]

Implicit in-context learning.arXiv preprint arXiv:2405.14660, 2024

Zhuowei Li, Zihao Xu, Ligong Han, Yunhe Gao, Song Wen, Di Liu, Hao Wang, and Dimitris N Metaxas. Implicit in-context learning.arXiv preprint arXiv:2405.14660, 2024

work page arXiv 2024
[25]

Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023

2023
[26]

Infiguiagent: A multimodal generalist gui agent with native reasoning and reflection

Yuhang Liu, Pengxiang Li, Zishu Wei, Congkai Xie, Xueyu Hu, Xinchen Xu, Shengyu Zhang, Xiaotian Han, Hongxia Yang, and Fei Wu. Infiguiagent: A multimodal generalist gui agent with native reasoning and reflection. InProceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), pages 103...

2026
[27]

Sewon Min, Xinxi Lyu, Ari Holtzman, Mikel Artetxe, Mike Lewis, Hannaneh Hajishirzi, and Luke Zettlemoyer. Rethinking the role of demonstrations: What makes in-context learning work? InProceedings of the 2022 conference on empirical methods in natural language processing, pages 11048–11064, 2022

2022
[28]

A new era of intelligence with gemini

Sundar Pichai, Demis Hassabis, and Koray Kavukcuoglu. A new era of intelligence with gemini
[29]

https://blog.google/products-and-platforms/products/gemini/gemini-3/ , 2025. 11

2025
[30]

Detecting harmful memes and their targets

Shraman Pramanick, Dimitar Dimitrov, Rituparna Mukherjee, Shivam Sharma, Md Shad Akhtar, Preslav Nakov, and Tanmoy Chakraborty. Detecting harmful memes and their targets. In Findings of the association for computational linguistics: ACL-IJCNLP 2021, pages 2783–2796, 2021

2021
[31]

Bytedance Seed. Seed1. 8 model card: Towards generalized real-world agency.arXiv preprint arXiv:2603.20633, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[32]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

work page internal anchor Pith review arXiv 2024
[33]

OpenAI GPT-5 System Card

Aaditya Singh, Adam Fry, Adam Perelman, Adam Tart, Adi Ganesh, Ahmed El-Kishky, Aidan McLaughlin, Aiden Low, AJ Ostrow, Akhila Ananthram, et al. Openai gpt-5 system card.arXiv preprint arXiv:2601.03267, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[34]

Towards vqa models that can read

Amanpreet Singh, Vivek Natarajan, Meet Shah, Yu Jiang, Xinlei Chen, Dhruv Batra, Devi Parikh, and Marcus Rohrbach. Towards vqa models that can read. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8317–8326, 2019

2019
[35]

Thinking with Images for Multimodal Reasoning: Foundations, Methods, and Future Frontiers

Zhaochen Su, Peng Xia, Hangyu Guo, Zhenhua Liu, Yan Ma, Xiaoye Qu, Jiaqi Liu, Yanshu Li, Kaide Zeng, Zhengyuan Yang, et al. Thinking with images for multimodal reasoning: Foundations, methods, and future frontiers.arXiv preprint arXiv:2506.23918, 2025

work page internal anchor Pith review arXiv 2025
[36]

Meta-analysis of cohen’s kappa.Health Services and Outcomes Research Methodology, 11(3):145–163, 2011

Shuyan Sun. Meta-analysis of cohen’s kappa.Health Services and Outcomes Research Methodology, 11(3):145–163, 2011

2011
[37]

Kimi Team, Tongtong Bai, Yifan Bai, Yiping Bao, SH Cai, Yuan Cao, Y Charles, HS Che, Cheng Chen, Guanduo Chen, et al. Kimi k2. 5: Visual agentic intelligence.arXiv preprint arXiv:2602.02276, 2026

work page internal anchor Pith review arXiv 2026
[38]

Llamav-o1: Rethinking step-by-step visual reasoning in llms

Omkar Thawakar, Dinura Dissanayake, Ketan Pravin More, Ritesh Thawkar, Ahmed Heakl, Noor Ahsan, Yuhao Li, Ilmuz Zaman Mohammed Zumri, Jean Lahoud, Rao Muhammad Anwer, et al. Llamav-o1: Rethinking step-by-step visual reasoning in llms. InFindings of the Association for Computational Linguistics: ACL 2025, pages 24290–24315, 2025

2025
[39]

Identifying and mitigating position bias of multi-image vision-language models

Xinyu Tian, Shu Zou, Zhaoyuan Yang, and Jing Zhang. Identifying and mitigating position bias of multi-image vision-language models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 10599–10609, 2025

2025
[40]

Mmifevol: Towards evolutionary multimodal instruction following

Haoyu Wang, Sihang Jiang, Xiangru Zhu, Yuyan Chen, Xiaojun Meng, Jiansheng Wei, Yitong Wang, and Yanghua Xiao. Mmifevol: Towards evolutionary multimodal instruction following. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 26206–26214, 2026

2026
[41]

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024

work page Pith review arXiv 2024
[42]

Finetuned Language Models Are Zero-Shot Learners

Jason Wei, Maarten Bosma, Vincent Y Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M Dai, and Quoc V Le. Finetuned language models are zero-shot learners.arXiv preprint arXiv:2109.01652, 2021

work page internal anchor Pith review arXiv 2021
[43]

Chain-of-thought prompting elicits reasoning in large language models

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35:24824–24837, 2022

2022
[44]

The learnability of in-context learning

Noam Wies, Yoav Levine, and Amnon Shashua. The learnability of in-context learning. Advances in Neural Information Processing Systems, 36:36637–36651, 2023

2023
[45]

Mmsearch-r1: Incentivizing lmms to search.arXiv preprint arXiv:2506.20670, 2025

Jinming Wu, Zihao Deng, Wei Li, Yiding Liu, Bo You, Bo Li, Zejun Ma, and Ziwei Liu. Mmsearch-r1: Incentivizing lmms to search.arXiv preprint arXiv:2506.20670, 2025. 12

work page arXiv 2025
[46]

Improve multi-modal embedding learning via explicit hard negative gradient amplifying.arXiv preprint arXiv:2506.02020, 2025

Youze Xue, Dian Li, and Gang Liu. Improve multi-modal embedding learning via explicit hard negative gradient amplifying.arXiv preprint arXiv:2506.02020, 2025

work page arXiv 2025
[47]

R1-onevision: Advancing generalized multimodal reasoning through cross-modal formalization

Yi Yang, Xiaoxuan He, Hongkun Pan, Xiyan Jiang, Yan Deng, Xingtao Yang, Haoyu Lu, Dacheng Yin, Fengyun Rao, Minfeng Zhu, et al. R1-onevision: Advancing generalized multimodal reasoning through cross-modal formalization. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 2376–2385, 2025

2025
[48]

Cc-ocr: A comprehensive and challenging ocr benchmark for evaluating large multimodal models in literacy

Zhibo Yang, Jun Tang, Zhaohai Li, Pengfei Wang, Jianqiang Wan, Humen Zhong, Xuejing Liu, Mingkun Yang, Peng Wang, Shuai Bai, et al. Cc-ocr: A comprehensive and challenging ocr benchmark for evaluating large multimodal models in literacy. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 21744–21754, 2025

2025
[49]

Mm-deepresearch: A simple and effective multimodal agentic search baseline.arXiv preprint arXiv:2603.01050, 2026

Huanjin Yao, Qixiang Yin, Min Yang, Ziwang Zhao, Yibo Wang, Haotian Luo, Jingyi Zhang, and Jiaxing Huang. Mm-deepresearch: A simple and effective multimodal agentic search baseline.arXiv preprint arXiv:2603.01050, 2026

work page arXiv 2026
[50]

DAPO: An Open-Source LLM Reinforcement Learning System at Scale

Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, et al. Dapo: An open-source llm reinforcement learning system at scale.arXiv preprint arXiv:2503.14476, 2025

work page internal anchor Pith review arXiv 2025
[51]

Mme-reasoning: A comprehensive benchmark for logical reasoning in mllms.arXiv preprint arXiv:2505.21327, 2025

Jiakang Yuan, Tianshuo Peng, Yilei Jiang, Yiting Lu, Renrui Zhang, Kaituo Feng, Chaoyou Fu, Tao Chen, Lei Bai, Bo Zhang, et al. Mme-reasoning: A comprehensive benchmark for logical reasoning in mllms.arXiv preprint arXiv:2505.21327, 2025

work page arXiv 2025
[52]

Multimodal Chain-of-Thought Reasoning in Language Models

Zhuosheng Zhang, Aston Zhang, Mu Li, Hai Zhao, George Karypis, and Alex Smola. Mul- timodal chain-of-thought reasoning in language models.arXiv preprint arXiv:2302.00923, 2023

work page internal anchor Pith review arXiv 2023
[53]

R1-omni: Explainable omni- multimodal emotion recognition with reinforcing learning,

Jiaxing Zhao, Xihan Wei, and Liefeng Bo. R1-omni: Explainable omni-multimodal emotion recognition with reinforcement learning.arXiv preprint arXiv:2503.05379, 2025

work page arXiv 2025
[54]

Swift: a scalable lightweight infrastructure for fine-tuning

Yuze Zhao, Jintao Huang, Jinghan Hu, Xingjun Wang, Yunlin Mao, Daoze Zhang, Zeyinzi Jiang, Zhikai Wu, Baole Ai, Ang Wang, et al. Swift: a scalable lightweight infrastructure for fine-tuning. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 29733–29735, 2025

2025
[55]

DeepEyes: Incentivizing "Thinking with Images" via Reinforcement Learning

Ziwei Zheng, Michael Yang, Jack Hong, Chenxiao Zhao, Guohai Xu, Le Yang, Chao Shen, and Xing Yu. Deepeyes: Incentivizing" thinking with images" via reinforcement learning.arXiv preprint arXiv:2505.14362, 2025

work page internal anchor Pith review arXiv 2025
[56]

Mdk12-bench: a multi-discipline benchmark for evaluating reasoning in multimodal large language models

Pengfei Zhou, Xiaopeng Peng, Fanrui Zhang, Zhaopan Xu, Jiaxin Ai, Yansheng Qiu, Wangbo Zhao, Jiajun Song, Chuanhao Li, Weidong Tang, et al. Mdk12-bench: a multi-discipline benchmark for evaluating reasoning in multimodal large language models. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 28982–28990, 2026

2026
[57]

highlighted tokens

Xin Zou, Di Lu, Yizhou Wang, Yibo Yan, Yuanhuiyi Lyu, Xu Zheng, Linfeng Zhang, and Xuming Hu. Don’t just chase “highlighted tokens” in mllms: Revisiting visual holistic context retention. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025. A Supplementary Experiment To comprehensively evaluate the robustness and generaliza...

work page arXiv 2025
[58]

inductive correctness gap

coefficient of 0.78, demonstrating a high degree of alignment and confirming the validity of our automated evaluation pipeline. C Dataset Details VQAv2 [2]is a large-scale visual question answering benchmark. The dataset is built upon images from the Microsoft COCO dataset and contains open-ended questions that require joint understanding of visual conten...
[59]

Identify the core task, key visual elements, and the implicit logic required to solve it

Target Problem Analysis: First, carefully observe the Target Question and its image. Identify the core task, key visual elements, and the implicit logic required to solve it
[60]

Reference Case Evaluation: Analyze each reference case one by one. For each case, you must explicitly evaluate: • Visual & Textual Elements: What is shown in the image and asked in the text? • Solution Logic: What is the underlying rule, formula, or method used to solve this case? • Relevance Comparison: Compare its task type and solution logic with the T...
[61]

helpful". Ignore

Rule Induction: Synthesize a general problem-solving rule ONLY from the cases judged as "helpful". Ignore "unhelpful" cases completely. Use tags like "<|case 1|>" to cite them
[62]

I don’t know,

Final Answer: Apply the induced rule to solve the Target Question step-by-step. Use the following strict template for your output: <|begin_of_target_analysis|> [Core Task]: {What is the user asking?} [Key Visuals]: {What are the crucial elements in the target image?} [Required Logic]: {What kind of reasoning or method is likely needed?} <|end_of_target_an...
[63]

Decompose the standard answer into individual sub-answers (e.g., for a single answer: one sub-answer; for a list/sequence: each element is a sub-answer)
[64]

For each sub-answer, determine independently whether the student’s response contains a correct match (Correct or Wrong)
[65]

The final score = number of correct sub-answers / total number of sub-answers. Output Format Use the following XML tags: <sub_answers> List each sub-answer on a separate line in the format: [sub-answer content]: Correct or Wrong </sub_answers> <reason>1–3 concise sentences summarizing the grading result.</reason> independently, this prompt guarantees a hi...
[66]

The black triangles are consistently oriented to the left

Observe the first group of three stars. The black triangles are consistently oriented to the left
[67]

The black triangles are consistently oriented to the right

Observe the second group of three stars. The black triangles are consistently oriented to the right
[68]

Apply the extracted rule: The entire sequence is governed by a consistent geometric transformation applied to the entire star shape
[69]

Option B shows the black triangles oriented to the right, matching the second group

Evaluate the options: Option A shows the black triangles oriented to the left, matching the first group. Option B shows the black triangles oriented to the right, matching the second group. Option C shows the black triangles oriented to the left, but the overall shape is rotated clockwise compared to the first group. Option D shows the black triangles ori...
[70]

[Final Result]: B <|end_of_answer|> Figure 8: Case study on the MMIQ dataset

Conclusion: Option B is the only option that perfectly matches the geometric transformation required to complete the sequence. [Final Result]: B <|end_of_answer|> Figure 8: Case study on the MMIQ dataset. Vanilla Gemini-3.1-pro and Qwen3-VL fail to reach the correct answer, while ourMMInductioninduces the underlying rule from retrieved demonstrations and ...
[71]

**Analyze the diagrams (Stages A and B):** * **Figure A (Jia):** The cell has 4 chromosomes and 8 nuclear DNA molecules. Since the normal somatic cell of a fruit fly has 2n= 8 chromosomes, a cell with 4 chromosomes and 8 DNA molecules indicates that homologous chromosomes have separated, but sister chromatids have not. This corresponds to a secondary sper...
[72]

During Meiosis I, homologous chromosomes (X and Y) separate into different secondary spermatocytes

**Analyze the sex chromosomes (Option B):** * A male fruit fly has sex chromosomes XY . During Meiosis I, homologous chromosomes (X and Y) separate into different secondary spermatocytes. * Thus, a secondary spermatocyte will contain either one X or one Y chromosome. * In stage B (Anaphase II), the centromeres split. A cell that received the X chromosome ...
[73]

A crossover occurred between non-sister chromatids of homologous chromosomes, producing 4 chromatids for this bivalent:AB,Ab,aB, andab

**Analyze the genotypes of sperms produced (Option C):** * The primary spermatocyte has the genotype AaBb. A crossover occurred between non-sister chromatids of homologous chromosomes, producing 4 chromatids for this bivalent:AB,Ab,aB, andab. * During Meiosis I, the homologous centromeres separate. * If the centromere attached to the A allele goes to one ...
[74]

**Analyze the test-cross of a male fruit fly (Option D):** * A fundamental rule in genetics is that **male Drosophila do not normally undergo crossing over** during meiosis; their linked genes exhibit complete linkage. 27 * Although the problem states that a rare crossover occurred in *one specific spermatogonium*, the vast majority of spermatogonia will ...
[75]

This corresponds to the prophase I stage of meiosis when homologous chromosomes pair up and undergo crossing over

**Analyze the Bar Graphs**: The first graph (Jia) shows 4 chromosomes and 8 DNA molecules, indicating that each chromosome consists of 2 sister chromatids. This corresponds to the prophase I stage of meiosis when homologous chromosomes pair up and undergo crossing over. The second graph (Yi) shows 8 chromosomes and 8 DNA molecules, indicating that each ch...
[76]

At this stage, the cell contains 4 chromosomes (2 pairs of homologues) and 4 X chromosomes (2 pairs of homologues)

**Determine the Stage**: Since the number of chromosomes doubles from 4 to 8, the cell must be undergoing metaphase I . At this stage, the cell contains 4 chromosomes (2 pairs of homologues) and 4 X chromosomes (2 pairs of homologues). The cell cannot be in metaphase II because M II would show 4 chromosomes (2 pairs of sister chromatids) and 4 DNA molecules
[77]

Statement A is correct

**Evaluate Option A**: Both stages (prophase I and metaphase I) involve 4 pairs of sister chromatids, so the number of chromatids is always 8. Statement A is correct
[78]

Statement B is correct

**Evaluate Option B**: During metaphase I, the cell may contain either 0 or 2 X chromosomes depending on whether the homologous X chromosomes are aligned on opposite poles. Statement B is correct
[79]

Since the male fruit fly has the genotype AaBb, the other possible sperm genotypes are ab and aB

**Evaluate Option C**: The sperm genotype is Ab. Since the male fruit fly has the genotype AaBb, the other possible sperm genotypes are ab and aB. Therefore, the genotype of a sperm from another secondary spermatocyte can indeed be either ab or aB. Statement C is correct
[80]

Broke up after 2 years of dating

**Evaluate Option D**: The given offspring segregation ratio (45:5:5:45) is characteristic of a test cross between a heterozygous recessive individual (aaBb) and a homozygous recessive individual (aabb). This indicates that the male fruit fly is heterozygous for the B locus (AaBb). However, the question asks which statement is *incorrect*. Since options A...

2020

Showing first 80 references.