Recognition: 2 theorem links
· Lean TheoremBoosting Reasoning in Large Multimodal Models via Activation Replay
Pith reviewed 2026-05-17 05:02 UTC · model grok-4.3
The pith
Replaying low-entropy activations from base models to RLVR-trained multimodal models improves their reasoning at test time.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Reinforcement Learning with Verifiable Rewards shifts low-entropy activations in large multimodal models in a manner associated with reasoning performance. Activation Replay restores these activations by replaying them from the input context of the base model into the RLVR model through targeted manipulation of visual tokens at test time, resulting in stronger reasoning across mathematics, o3-style visual agents, and video tasks while increasing Pass@K and widening reasoning coverage.
What carries the argument
Activation Replay, a training-free method that manipulates visual tokens at test time to replay low-entropy activations from the input context of base large multimodal models into their RLVR-trained counterparts.
If this is right
- Reasoning accuracy rises on mathematics problems when low-entropy activations are replayed from the base model.
- Performance improves on o3-like visual agent tasks under the same replay procedure.
- Video reasoning quality increases after the test-time token manipulation.
- Pass@K metrics become higher, indicating more successful solution paths are found.
- The narrower set of reasoning routes typical after RLVR is broadened by the replay step.
Where Pith is reading between the lines
- Base models appear to keep activation patterns that support wider reasoning even after RLVR has narrowed them, suggesting the original patterns carry recoverable value.
- Similar replay of specific activation ranges could be tested on other post-training techniques that alter model behavior in comparable ways.
- If the same low-entropy shift occurs in language-only models, the replay approach might transfer without visual tokens.
- The method opens a route to hybrid systems that combine RLVR for capability gains with periodic restoration of base-model patterns to maintain diversity.
Load-bearing premise
The shift in low-entropy activations produced by RLVR training is causally helpful for reasoning, so that restoring the original low-entropy activations from the base model will produce consistent gains.
What would settle it
A direct comparison on the same reasoning benchmarks showing that models using Activation Replay achieve no higher accuracy or Pass@K than the RLVR models alone, or that swapping in high-entropy activations instead produces equal or better results.
Figures
read the original abstract
Recently, Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as an effective approach to incentivizing reasoning capability in Large Multimodal Models (LMMs), while the underlying mechanisms behind this post-training paradigm are poorly understood. We begin by exploring how input activations are affected by RLVR through the perspective of logit lens. Our systematic investigations across multiple post-trained LMMs suggest that RLVR shifts low-entropy activations unexpectedly, while high-entropy ones are less affected. We further demonstrate that such phenomena are associated with LMM reasoning by controlled experiments, suggesting a potentially beneficial role of modulating low-entropy activations. To this end, we propose Activation Replay, a novel simple yet effective training-free approach that boosts multimodal reasoning of post-trained LMMs without requiring expensive policy optimization. Our design involves manipulation of visual tokens at test time, replaying low-entropy activations from the input context of base LMMs to regulating the RLVR counterparts. Activation Replay triggers better reasoning across diverse scenarios, including mathematics, o3-like visual agents, and video reasoning. We further show that Activation Replay boosts Pass@K and mitigates narrower reasoning coverage of RLVR. Our design is compared against alternative choices, such as replaying high-entropy activations instead of low-entropy ones, or direct cross-model intervention instead of manipulating input tokens, demonstrating the superiority of our implementation. Code is publicly available at https://github.com/latentcraft/replay.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper examines how Reinforcement Learning with Verifiable Rewards (RLVR) alters input activations in Large Multimodal Models (LMMs) via logit-lens analysis, observing that RLVR disproportionately shifts low-entropy activations while leaving high-entropy ones relatively stable. It links this shift to reasoning performance through controlled experiments and proposes Activation Replay, a training-free test-time method that replays low-entropy activations from base LMMs into RLVR models by manipulating visual tokens. The approach is claimed to improve reasoning on mathematics, o3-like visual agents, and video tasks, increase Pass@K, and broaden reasoning coverage compared to RLVR alone, with ablations showing superiority over high-entropy replay or direct cross-model interventions.
Significance. If the empirical findings hold, the work provides a lightweight, training-free intervention for enhancing multimodal reasoning in post-trained LMMs and offers mechanistic insight into RLVR effects on activations. Public code release and systematic comparisons to alternatives are positive elements that support reproducibility and allow direct evaluation of the method.
major comments (2)
- [Abstract and Experiments] The central causal claim—that replaying specifically the low-entropy activations from base models improves reasoning—rests on indirect comparisons (high-entropy replay vs. low-entropy replay, and direct cross-model intervention). However, because the intervention is implemented via manipulation of visual tokens at test time, it is unclear whether gains arise from the entropy-selected values themselves or from ancillary changes in token statistics, diversity, or attention patterns. A minimal intervention that isolates only the low-entropy property while holding other input statistics fixed would be required to substantiate the mechanism.
- [Abstract] The manuscript reports systematic logit-lens observations and controlled experiments linking low-entropy shifts to reasoning, yet the abstract and available description lack quantitative details such as exact effect sizes, error bars, number of runs, or full methodology for the activation measurements and replay procedure. This limits verification of the reported associations and performance gains.
minor comments (2)
- Clarify the precise definition and selection criterion for 'low-entropy activations' (e.g., entropy threshold, layer, or token position) to enable exact reproduction.
- Include standard deviation or confidence intervals for all reported metrics (Pass@K, accuracy) across the diverse scenarios.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We address the major comments point by point below, clarifying our experimental design and committing to revisions that strengthen the presentation of quantitative results and causal evidence.
read point-by-point responses
-
Referee: [Abstract and Experiments] The central causal claim—that replaying specifically the low-entropy activations from base models improves reasoning—rests on indirect comparisons (high-entropy replay vs. low-entropy replay, and direct cross-model intervention). However, because the intervention is implemented via manipulation of visual tokens at test time, it is unclear whether gains arise from the entropy-selected values themselves or from ancillary changes in token statistics, diversity, or attention patterns. A minimal intervention that isolates only the low-entropy property while holding other input statistics fixed would be required to substantiate the mechanism.
Authors: We agree that tighter isolation of the low-entropy property would further strengthen the causal interpretation. Our existing controls compare low-entropy versus high-entropy replay under identical token-manipulation mechanics (same number of tokens, same visual-token replacement procedure), which already holds many ancillary statistics constant while varying only the entropy criterion. The direct cross-model baseline provides an orthogonal control without token manipulation. To address the concern more directly, we will add in revision a new ablation that selects replay candidates matched on token norm, diversity, and attention contribution but differing in entropy; this will help isolate whether performance gains track the low-entropy property specifically. We view this as a valuable addition rather than a fundamental flaw in the current design. revision: yes
-
Referee: [Abstract] The manuscript reports systematic logit-lens observations and controlled experiments linking low-entropy shifts to reasoning, yet the abstract and available description lack quantitative details such as exact effect sizes, error bars, number of runs, or full methodology for the activation measurements and replay procedure. This limits verification of the reported associations and performance gains.
Authors: We accept this observation. The revised abstract will now report concrete effect sizes (e.g., average accuracy gains of 4.8–12.3 percentage points across math, visual-agent, and video benchmarks), standard deviations from three independent runs, and explicit error bars on all main figures. We have also expanded the methods section and appendix with the precise logit-lens implementation (layer indices, projection details), the exact replay procedure (token selection threshold, replacement rule), and the number of runs used for all reported metrics. These additions will make the quantitative claims directly verifiable. revision: yes
Circularity Check
No circularity: empirical test-time intervention supported by direct experiments
full rationale
The paper reports logit-lens observations of RLVR-induced activation shifts, controlled experiments linking low-entropy activations to reasoning performance, and ablations comparing low-entropy replay against high-entropy replay and cross-model baselines. No equations, derivations, or first-principles claims appear; the method is a training-free manipulation of visual tokens at inference time whose gains are measured against explicit alternative interventions. All load-bearing evidence consists of external empirical comparisons rather than self-definitional reductions, fitted parameters renamed as predictions, or self-citation chains. The derivation chain is therefore self-contained against the reported benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption RLVR shifts low-entropy activations in a manner associated with improved LMM reasoning capability
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanJ_uniquely_calibrated_via_higher_derivative unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We propose Activation Replay, a novel simple yet effective training-free approach that boosts multimodal reasoning of post-trained LMMs... replaying low-entropy activations from the input context of base LMMs to regulating the RLVR counterparts.
-
IndisputableMonolith/Foundation/ArithmeticFromLogic.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
RLVR shifts low-entropy activations unexpectedly, while high-entropy ones are less affected.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 2 Pith papers
-
Persistent Visual Memory: Sustaining Perception for Deep Generation in LVLMs
PVM adds a parallel branch to LVLMs that directly supplies visual embeddings to prevent attention decay over long generated sequences, yielding accuracy gains on reasoning tasks with minimal overhead.
-
Persistent Visual Memory: Sustaining Perception for Deep Generation in LVLMs
PVM adds a parallel learnable branch to LVLMs that supplies visual embeddings on demand to structurally prevent attention decay and visual signal dilution during deep autoregressive generation.
Reference graph
Works this paper leans on
-
[1]
Marah Abdin, Jyoti Aneja, Harkirat Behl, S ´ebastien Bubeck, Ronen Eldan, Suriya Gunasekar, Michael Harrison, Russell J Hewett, Mojan Javaheripi, Piero Kauffmann, et al. Phi-4 technical report.arXiv preprint arXiv:2412.08905, 2024. 1
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[2]
Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ah- mad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774,
work page internal anchor Pith review Pith/arXiv arXiv
-
[3]
Claude 3.5 sonnet.https : / / www
Anthropic. Claude 3.5 sonnet.https : / / www . anthropic.com/claude, 2024. Large language model. 6
work page 2024
-
[4]
Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923, 2025. 1, 6
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[5]
Large Language Monkeys: Scaling Inference Compute with Repeated Sampling
Bradley Brown, Jordan Juravsky, Ryan Ehrlich, Ronald Clark, Quoc V Le, Christopher R ´e, and Azalia Mirhoseini. Large language monkeys: Scaling inference compute with repeated sampling.arXiv preprint arXiv:2407.21787, 2024. 2, 5
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[6]
Sft or rl? an early investigation into training r1-like reasoning large vision-language models, 2025
Hardy Chen, Haoqin Tu, Fali Wang, Hui Liu, Xianfeng Tang, Xinya Du, Yuyin Zhou, and Cihang Xie. Sft or rl? an early investigation into training r1-like reasoning large vision-language models, 2025. 1
work page 2025
-
[7]
Shuang Chen, Yue Guo, Zhaochen Su, Yafu Li, Yulun Wu, Jiacheng Chen, Jiayu Chen, Weijie Wang, Xiaoye Qu, and Yu Cheng. Advancing multimodal reasoning: From opti- mized cold start to staged reinforcement learning.arXiv preprint arXiv:2506.04207, 2025. 1, 8
-
[8]
Scaling rl to long videos.arXiv preprint arXiv:2507.07966, 2025
Yukang Chen, Wei Huang, Baifeng Shi, Qinghao Hu, Han- rong Ye, Ligeng Zhu, Zhijian Liu, Pavlo Molchanov, Jan Kautz, Xiaojuan Qi, et al. Scaling rl to long videos.arXiv preprint arXiv:2507.07966, 2025. 1, 8
-
[9]
Video-Holmes: Can MLLM Think Like Holmes for Complex Video Reasoning?
Junhao Cheng, Yuying Ge, Teng Wang, Yixiao Ge, Jing Liao, and Ying Shan. Video-holmes: Can mllm think like holmes for complex video reasoning?arXiv preprint arXiv:2505.21374, 2025. 1, 6
work page internal anchor Pith review arXiv 2025
-
[10]
VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs
Zesen Cheng, Sicong Leng, Hang Zhang, Yifei Xin, Xin Li, Guanzheng Chen, Yongxin Zhu, Wenqi Zhang, Ziyang Luo, Deli Zhao, and Lidong Bing. Videollama 2: Advancing spatial-temporal modeling and audio understanding in video- llms.arXiv preprint arXiv:2406.07476, 2024. 1, 8
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[11]
Wenliang Dai, Junnan Li, Dongxu Li, Anthony Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale N Fung, and Steven Hoi. Instructblip: Towards general-purpose vision- language models with instruction tuning.Advances in neural information processing systems, 36:49250–49267, 2023. 1, 8
work page 2023
-
[12]
Sebastian Farquhar, Jannik Kossen, Lorenz Kuhn, and Yarin Gal. Detecting hallucinations in large language models using semantic entropy.Nature, 630(8017):625–630, 2024. 3
work page 2024
-
[13]
Video-R1: Reinforcing Video Reasoning in MLLMs
Kaituo Feng, Kaixiong Gong, Bohao Li, Zonghao Guo, Yibing Wang, Tianshuo Peng, Junfei Wu, Xiaoying Zhang, Benyou Wang, and Xiangyu Yue. Video-r1: Reinforcing video reasoning in mllms.arXiv preprint arXiv:2503.21776,
work page internal anchor Pith review Pith/arXiv arXiv
-
[14]
Dong Guo, Faming Wu, Feida Zhu, Fuxing Leng, Guang Shi, Haobin Chen, Haoqi Fan, Jian Wang, Jianyu Jiang, Jiawei Wang, et al. Seed1. 5-vl technical report.arXiv preprint arXiv:2505.07062, 2025. 1
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[15]
Video-MMMU: Evaluating Knowledge Acquisition from Multi-Discipline Professional Videos
Kairui Hu, Penghao Wu, Fanyi Pu, Wang Xiao, Yuanhan Zhang, Xiang Yue, Bo Li, and Ziwei Liu. Video-mmmu: Evaluating knowledge acquisition from multi-discipline pro- fessional videos.arXiv preprint arXiv:2501.13826, 2025. 6
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[16]
Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perel- man, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Weli- hinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card. arXiv preprint arXiv:2410.21276, 2024. 6
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[17]
Interpreting and editing vision-language representations to mitigate hallucinations
Nicholas Jiang, Anish Kachinthaya, Suzanne Petryk, and Yossi Gandelsman. Interpreting and editing vision-language representations to mitigate hallucinations. InThe Thir- teenth International Conference on Learning Representa- tions, 2025. 8
work page 2025
-
[18]
Xin Lai, Junyi Li, Wei Li, Tao Liu, Tianjian Li, and Hengshuang Zhao. Mini-o3: Scaling up reasoning pat- terns and interaction turns for visual search.arXiv preprint arXiv:2509.07969, 2025. 1, 6
-
[19]
Sicong Leng, Jing Wang, Jiaxi Li, Hao Zhang, Zhiqiang Hu, Boqiang Zhang, Yuming Jiang, Hang Zhang, Xin Li, Lidong Bing, et al. Mmr1: Enhancing multimodal reasoning with variance-aware sampling and open resources.arXiv preprint arXiv:2509.21268, 2025. 6, 8, 1
-
[20]
LLaVA-OneVision: Easy Visual Task Transfer
Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Zi- wei Liu, et al. Llava-onevision: Easy visual task transfer. arXiv preprint arXiv:2408.03326, 2024. 6, 8
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[21]
Video-LLaVA: Learning United Visual Representation by Alignment Before Projection
Bin Lin, Yang Ye, Bin Zhu, Jiaxi Cui, Munan Ning, Peng Jin, and Li Yuan. Video-llava: Learning united visual rep- resentation by alignment before projection.arXiv preprint arXiv:2311.10122, 2023. 8
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[22]
Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023
Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023. 1, 8
work page 2023
-
[23]
Pan Lu, Ran Gong, Shibiao Jiang, Liang Qiu, Siyuan Huang, Xiaodan Liang, and Song-Chun Zhu. Inter-gps: Interpretable geometry problem solving with formal language and sym- bolic reasoning.arXiv preprint arXiv:2105.04165, 2021. 8
-
[24]
MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts
Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. Mathvista: Evaluating mathemat- ical reasoning of foundation models in visual contexts.arXiv preprint arXiv:2310.02255, 2023. 2, 6, 1
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[25]
Ruilin Luo, Zhuofan Zheng, Yifan Wang, Yiyao Yu, Xinzhe Ni, Zicheng Lin, Jin Zeng, and Yujiu Yang. Ursa: Under- standing and verifying chain-of-thought reasoning in multi- modal mathematics.arXiv preprint arXiv:2501.04686, 2025. 8
-
[26]
ChartQA: A Benchmark for Question Answering about Charts with Visual and Logical Reasoning
Ahmed Masry, Do Xuan Long, Jia Qing Tan, Shafiq Joty, and Enamul Hoque. Chartqa: A benchmark for question an- swering about charts with visual and logical reasoning.arXiv preprint arXiv:2203.10244, 2022. 8 9
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[27]
A Survey of Context Engineering for Large Language Models
Lingrui Mei, Jiayu Yao, Yuyao Ge, Yiwei Wang, Baolong Bi, Yujun Cai, Jiazhi Liu, Mingyu Li, Zhong-Zhi Li, Duzhen Zhang, et al. A survey of context engineering for large lan- guage models.arXiv preprint arXiv:2507.13334, 2025. 3
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[28]
Fanqing Meng, Lingxiao Du, Zongkai Liu, Zhixiang Zhou, Quanfeng Lu, Daocheng Fu, Botian Shi, Wenhai Wang, Jun- jun He, Kaipeng Zhang, et al. Mm-eureka: Exploring visual aha moment with rule-based large-scale reinforcement learn- ing.CoRR, 2025. 1, 3, 4, 6, 7, 8
work page 2025
-
[29]
Interpreting gpt: The logit lens, 2020
Neel Nanda. Interpreting gpt: The logit lens, 2020. Ac- cessed: December 1, 2025. 2, 3, 8
work page 2020
-
[30]
Towards interpreting visual in- formation processing in vision-language models
Clement Neo, Luke Ong, Philip Torr, Mor Geva, David Krueger, and Fazl Barez. Towards interpreting visual in- formation processing in vision-language models. InThe Thirteenth International Conference on Learning Represen- tations, 2025. 2, 3, 8
work page 2025
-
[31]
We-Math: Does Your Large Multimodal Model Achieve Human-like Mathematical Reasoning?
Runqi Qiao, Qiuna Tan, Guanting Dong, Minhui Wu, Chong Sun, Xiaoshuai Song, Zhuoma GongQue, Shanglin Lei, Zhe Wei, Miaoxuan Zhang, et al. We-math: Does your large mul- timodal model achieve human-like mathematical reasoning? arXiv preprint arXiv:2407.01284, 2024. 1, 2, 4, 6
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[32]
Learning transferable visual models from natural language supervi- sion
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021. 8
work page 2021
-
[34]
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of math- ematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024. 8
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[35]
VLM-R1: A Stable and Generalizable R1-style Large Vision-Language Model
Haozhan Shen, Peng Liu, Jingcheng Li, Chunxin Fang, Yibo Ma, Jiajia Liao, Qiaoli Shen, Zilun Zhang, Kangjia Zhao, Qianqian Zhang, et al. Vlm-r1: A stable and generaliz- able r1-style large vision-language model.arXiv preprint arXiv:2504.07615, 2025. 2
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[36]
Haozhan Shen, Kangjia Zhao, Tiancheng Zhao, Ruochen Xu, Zilun Zhang, Mingwei Zhu, and Jianwei Yin. Zoom- eye: Enhancing multimodal llms with human-like zooming capabilities through tree-based image exploration. InPro- ceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 6613–6629, 2025. 1, 6
work page 2025
-
[37]
Fangxun Shu, Yongjie Ye, Yue Liao, Zijian Kang, Weijie Yin, Jiacong Wang, Xiao Liang, Shuicheng Yan, and Chao Feng. Sail-rl: Guiding mllms in when and how to think via dual-reward rl tuning.arXiv preprint arXiv:2511.02280,
-
[38]
Gemini: A Family of Highly Capable Multimodal Models
Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean- Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805, 2023. 6
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[39]
Kimi Team, Angang Du, Bohong Yin, Bowei Xing, Bowen Qu, Bowen Wang, Cheng Chen, Chenlin Zhang, Chen- zhuang Du, Chu Wei, Congcong Wang, Dehao Zhang, Dikang Du, Dongliang Wang, Enming Yuan, Enzhe Lu, Fang Li, Flood Sung, Guangda Wei, Guokun Lai, Han Zhu, Hao Ding, Hao Hu, Hao Yang, Hao Zhang, Haoning Wu, Hao- tian Yao, Haoyu Lu, Heng Wang, Hongcheng Gao, H...
work page 2025
-
[40]
Qvq: To see the world with wisdom, 2024
Qwen Team. Qvq: To see the world with wisdom, 2024. 6
work page 2024
-
[41]
LLaMA: Open and Efficient Foundation Language Models
Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timoth´ee Lacroix, Baptiste Rozi`ere, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023. 8
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[42]
Llama 2: Open Foundation and Fine-Tuned Chat Models
Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models.arXiv preprint arXiv:2307.09288, 2023. 8
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[43]
Zhongwei Wan, Zhihao Dou, Che Liu, Yu Zhang, Dongfei Cui, Qinjian Zhao, Hui Shen, Jing Xiong, Yi Xin, Yifan Jiang, et al. Srpo: Enhancing multimodal llm reasoning via reflection-aware reinforcement learning.arXiv preprint arXiv:2506.01713, 2025. 2, 8
-
[44]
VL-Rethinker: Incentivizing Self-Reflection of Vision-Language Models with Reinforcement Learning
Haozhe Wang, Chao Qu, Zuming Huang, Wei Chu, Fangzhen Lin, and Wenhu Chen. Vl-rethinker: Incentivizing self-reflection of vision-language models with reinforcement learning.arXiv preprint arXiv:2504.08837, 2025. 1, 2, 4, 6, 8
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[45]
Ke Wang, Junting Pan, Weikang Shi, Zimu Lu, Houxing Ren, Aojun Zhou, Mingjie Zhan, and Hongsheng Li. Mea- suring multimodal mathematical reasoning with math-vision dataset.Advances in Neural Information Processing Sys- tems, 37:95095–95169, 2024. 1, 2, 4, 6
work page 2024
-
[46]
Peiyu Wang, Yichen Wei, Yi Peng, Xiaokun Wang, Wei- jie Qiu, Wei Shen, Tianyidan Xie, Jiangbo Pei, Jianhao Zhang, Yunzhuo Hao, et al. Skywork r1v2: Multimodal hybrid reinforcement learning for reasoning.arXiv preprint arXiv:2504.16656, 2025. 7
-
[47]
Shijian Wang, Jiarui Jin, Xingjian Wang, Linxin Song, Run- hao Fu, Hecheng Wang, Zongyuan Ge, Yuan Lu, and Xuelian Cheng. Video-thinker: Sparking” thinking with videos” via reinforcement learning.arXiv preprint arXiv:2510.23473,
-
[48]
Shenzhi Wang, Le Yu, Chang Gao, Chujie Zheng, Shix- uan Liu, Rui Lu, Kai Dang, Xionghui Chen, Jianxin Yang, Zhenru Zhang, et al. Beyond the 80/20 rule: High-entropy minority tokens drive effective reinforcement learning for llm reasoning.arXiv preprint arXiv:2506.01939, 2025. 2, 3, 4
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[49]
Wenbin Wang, Liang Ding, Minyan Zeng, Xiabin Zhou, Li Shen, Yong Luo, and Dacheng Tao. Divide, conquer and combine: A training-free framework for high-resolution im- age perception in multimodal large language models.arXiv preprint, 2024. 2, 6, 1
work page 2024
-
[50]
V?: Guided visual search as a core mechanism in multimodal llms
Penghao Wu and Saining Xie. V?: Guided visual search as a core mechanism in multimodal llms. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13084–13094, 2024. 8
work page 2024
-
[51]
LogicVista: Multimodal LLM Logical Reasoning Benchmark in Visual Contexts
Yijia Xiao, Edward Sun, Tianyu Liu, and Wei Wang. Log- icvista: Multimodal llm logical reasoning benchmark in vi- sual contexts.arXiv preprint arXiv:2407.04973, 2024. 1, 2, 6, 7
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[52]
Visual planning: Let’s think only with images.arXiv preprint arXiv:2505.11409, 2025
Yi Xu, Chengzu Li, Han Zhou, Xingchen Wan, Caiqi Zhang, Anna Korhonen, and Ivan Vuli´c. Visual planning: Let’s think only with images.arXiv preprint arXiv:2505.11409, 2025. 8
-
[53]
An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025. 6, 8
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[54]
DAPO: An Open-Source LLM Reinforcement Learning System at Scale
Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xi- aochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gao- hong Liu, Lingjun Liu, et al. Dapo: An open-source llm reinforcement learning system at scale.arXiv preprint arXiv:2503.14476, 2025. 8
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[55]
Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for ex- pert agi
Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, et al. Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for ex- pert agi. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9556– 9567, 2024. 6, 7
work page 2024
-
[56]
Mmmu-pro: A more robust multi-discipline multimodal understanding benchmark
Xiang Yue, Tianyu Zheng, Yuansheng Ni, Yubo Wang, Kai Zhang, Shengbang Tong, Yuxuan Sun, Botao Yu, Ge Zhang, Huan Sun, et al. Mmmu-pro: A more robust multi-discipline multimodal understanding benchmark. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 15134–15186,
-
[57]
Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?
Yang Yue, Zhiqi Chen, Rui Lu, Andrew Zhao, Zhaokai Wang, Shiji Song, and Gao Huang. Does reinforcement learning really incentivize reasoning capacity in llms beyond the base model?arXiv preprint arXiv:2504.13837, 2025. 2, 7
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[58]
Sigmoid loss for language image pre-training
Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training. InProceedings of the IEEE/CVF international conference on computer vision, pages 11975–11986, 2023. 8
work page 2023
-
[59]
Jingyi Zhang, Jiaxing Huang, Huanjin Yao, Shunyu Liu, Xikun Zhang, Shijian Lu, and Dacheng Tao. R1-vl: Learn- ing to reason with multimodal large language models via step-wise group relative policy optimization.arXiv preprint arXiv:2503.12937, 2025. 8
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[60]
Renrui Zhang, Dongzhi Jiang, Yichi Zhang, Haokun Lin, Ziyu Guo, Pengshuo Qiu, Aojun Zhou, Pan Lu, Kai-Wei Chang, Yu Qiao, et al. Mathverse: Does your multi-modal llm truly see the diagrams in visual math problems? In European Conference on Computer Vision, pages 169–186. Springer, 2024. 2, 4, 6, 7, 1
work page 2024
-
[61]
Mmvu: Measuring expert-level multi- discipline video understanding
Yilun Zhao, Haowei Zhang, Lujing Xie, Tongyan Hu, Guo Gan, Yitao Long, Zhiyuan Hu, Weiyuan Chen, Chuhan Li, Zhijian Xu, et al. Mmvu: Measuring expert-level multi- discipline video understanding. InProceedings of the Com- puter Vision and Pattern Recognition Conference, pages 8475–8489, 2025. 6
work page 2025
-
[62]
Group Sequence Policy Optimization
Chujie Zheng, Shixuan Liu, Mingze Li, Xiong-Hui Chen, Bowen Yu, Chang Gao, Kai Dang, Yuqiong Liu, Rui Men, An Yang, et al. Group sequence policy optimization.arXiv preprint arXiv:2507.18071, 2025. 8
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[63]
DeepEyes: Incentivizing "Thinking with Images" via Reinforcement Learning
Ziwei Zheng, Michael Yang, Jack Hong, Chenxiao Zhao, Guohai Xu, Le Yang, Chao Shen, and Xing Yu. Deep- eyes: Incentivizing” thinking with images” via reinforce- ment learning.arXiv preprint arXiv:2505.14362, 2025. 1, 6, 7, 8, 3
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[64]
Junjie Zhou, Yan Shu, Bo Zhao, Boya Wu, Shitao Xiao, Xi Yang, Yongping Xiong, Bo Zhang, Tiejun Huang, and Zheng Liu. Mlvu: A comprehensive benchmark for multi- task long video understanding.arXiv e-prints, pages arXiv– 2406, 2024. 6
work page 2024
-
[65]
Chenming Zhu, Tai Wang, Wenwei Zhang, Jiangmiao Pang, and Xihui Liu. Llava-3d: A simple yet effective pathway to empowering lmms with 3d-awareness.arXiv preprint arXiv:2409.18125, 2024. 1, 8
-
[66]
InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models
Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shen- glong Ye, Lixin Gu, Hao Tian, Yuchen Duan, Weijie Su, Jie Shao, et al. Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models.arXiv preprint arXiv:2504.10479, 2025. 1, 6
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[67]
Muzhi Zhu, Hao Zhong, Canyu Zhao, Zongze Du, Zheng Huang, Mingyu Liu, Hao Chen, Cheng Zou, Jingdong Chen, Ming Yang, et al. Active-o3: Empowering multimodal large language models with active perception via grpo.arXiv preprint arXiv:2505.21457, 2025. 8
-
[68]
Chengke Zou, Xingang Guo, Rui Yang, Junyu Zhang, Bin Hu, and Huan Zhang. Dynamath: A dynamic visual bench- mark for evaluating mathematical reasoning robustness of vision language models.arXiv preprint arXiv:2411.00836,
-
[69]
Xin Zou, Yizhou Wang, Yibo Yan, Yuanhuiyi Lyu, Ken- ing Zheng, Sirui Huang, Junkai Chen, Peijie Jiang, Jia Liu, Chang Tang, et al. Look twice before you answer: Memory-space visual retracing for hallucination mitiga- tion in multimodal large language models.arXiv preprint arXiv:2410.03577, 2024. 8 11 Boosting Reasoning in Large Multimodal Models via Activ...
-
[71]
- The green circle ( ) next to the sequence indicate that we need to apply one of the rules
**Analyzing the given sequence:** - The sequence starts with: $ \square \square \star $. - The green circle ( ) next to the sequence indicate that we need to apply one of the rules
-
[74]
**Matching with the options:** - None of the options directly match $ \star \star \star $, but we need to re-evaluate the rules and the sequence again
-
[75]
Changes shading on 1st and 3rd symbol
**Re-evaluting the rules:** - The green circle ( ) might suggest a different rule. Let's recheck: - The green circle ( ) could also suggest **Rule 2:** "Changes shading on 1st and 3rd symbol." - According to Rule 2, we need to change the shading on the 1st and 3rd symbol
-
[76]
- Change shading on 1st and 3rd symbol: $ \star \square \star $
**Applying Rule 2:** - The original sequence is: $ \square \square \star $. - Change shading on 1st and 3rd symbol: $ \star \square \star $
-
[77]
<answer> The answer is $ \boxed{C} $
**Matching with the options:** - The sequence $ \star \square \star $ matches with option **C**. <answer> The answer is $ \boxed{C} $. </answer> To solve this problem, we need to apply the rules given in the top section of the image to the sequence in the bottom section. Let's break it down step by step:
-
[78]
- **Rule 2:**Changes shading on 1st and 3rd symbol
**Understanding the rules:** - **Rule 1:** Each symbol moves one place to the right. - **Rule 2:**Changes shading on 1st and 3rd symbol. - **Rule 3:**Reflects all symbols in X axis. - **Rule 4:**Replaced all squares with stars
-
[79]
- The green circle ( ) indicates that we need to apply one of the rules
**Analyzing the given sequence:** - The sequence starts with: $ \square \square \star $. - The green circle ( ) indicates that we need to apply one of the rules
-
[80]
**Applying the rules:** - The green circle ( ) suggests we should apply **Rule 4:** "Replaced alls with stars." - According to Rule 4, we need to replace all square ( $ \square $ ) with stars ( $ \star $)
-
[81]
- Replaced alls with stars: $ \star \star \star $
**Applying Rule 4:** - The original sequence is: $ \square \square \star $. - Replaced alls with stars: $ \star \star \star $
-
[82]
**Matching with the options:** - None of the options directly match $ \star \star \star $, but we need to consider the transformation rules again. - Let's re-evaluate the rules and the sequence: - The sequence $ \square \square \star $ might need to be transformed using Rule 1
-
[83]
- Moving each symbol one place to the right: $ \star \square \square $
**Applying Rule 1:** - Rule 1: Each symbol moves one place to the right. - Moving each symbol one place to the right: $ \star \square \square $
-
[84]
- B: $ \star \square \blacksquare $
**Re-evaluting the options:** - The options are: - A: $ \star \square \square $. - B: $ \star \square \blacksquare $. - C: $ \square \star \star $. - D: $ \square \star \square $
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.