pith. machine review for the scientific record. sign in

arxiv: 2603.04415 · v2 · submitted 2026-02-04 · 💻 cs.CL · cs.CV

Recognition: 1 theorem link

· Lean Theorem

Dual Tuning for Reasoning Efficacy-Driven Data Curation in Multimodal LLM Training

Authors on Pith no claims yet

Pith reviewed 2026-05-16 08:09 UTC · model grok-4.3

classification 💻 cs.CL cs.CV
keywords Dual Tuningreasoning efficacydata curationmultimodal LLMsChain-of-Thoughtpost-training strategiesCoT data qualityspatial and mathematical tasks
0
0 comments X

The pith

Dual Tuning jointly evaluates data benefit and reasoning gains to curate training sets for multimodal LLMs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes Dual Tuning as a framework that, for a given base model and target task, checks both whether a dataset improves overall performance and whether Chain-of-Thought reasoning training adds measurable gains over direct-answer training. This matters because current practice of releasing separate instruct and thinking models consumes extra resources without clear rules for when each approach helps. The method is tested on spatial, mathematical, and multi-disciplinary multimodal tasks and further examines how reinforcement learning and thinking patterns influence the gains. It produces concrete labels for each dataset: suitable for reasoning training, better for direct answers, or harmful in either case. The result supplies explicit numerical criteria for choosing data and matching post-training strategies.

Core claim

Dual Tuning is a reasoning efficacy-driven data curation framework that, given a target task and a base model, jointly evaluates whether the training data is beneficial and whether reasoning training with current CoT content yields positive gains over non-reasoning alternatives, thereby providing quantitative criteria for selecting appropriate training data and matching post-training strategies.

What carries the argument

Dual Tuning, the joint evaluation of data benefit and reasoning gain performed directly from the base model and task.

If this is right

  • Guides data curation by identifying data that benefit reasoning training, data better suited to direct-answer training, and data detrimental under both modes.
  • Supplies quantitative criteria for deciding when reasoning post-training is beneficial on multimodal tasks.
  • Applies the criteria across spatial, mathematical, and multi-disciplinary tasks.
  • Reveals effects of reinforcement learning and thinking patterns on reasoning efficacy.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Could reduce the need to maintain parallel instruct and thinking model releases by tailoring strategies per dataset.
  • May generalize to curating data for other training regimes where multiple post-training modes compete.
  • Supports automated pipelines that decide training mode before large-scale runs, lowering wasted compute on mismatched data.

Load-bearing premise

The joint evaluation of data benefit and reasoning gain can be performed reliably from the base model and task without introducing selection bias or requiring post-hoc adjustments that affect the final curation decisions.

What would settle it

An experiment showing that datasets labeled beneficial for reasoning by Dual Tuning produce no gain or negative gain when used for reasoning training compared with direct-answer training on the same data, across multiple tasks.

Figures

Figures reproduced from arXiv: 2603.04415 by Jianing Li, Jingdong Chen, Qingpei Guo, Ruobing Zheng, Tianqi Li, Yi Yuan.

Figure 1
Figure 1. Figure 1: The base model shows discrepancies in initial performance between CoT and DA inference across various tasks. Positive [PITH_FULL_IMAGE:figures/full_fig_p005_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: We plot each task’s GainCoT and GainDA in a two-dimensional coordinate map. Through three distinct regions, we categorize the suitability of different tasks for the two training modes. reasoning modes in the base model creates different training starting points, which subsequently affect the gains from CoT and DA training. 3.4 Reasoning Suitability of Multi-disciplinary Tasks In recent technical reports [7… view at source ↗
Figure 3
Figure 3. Figure 3: We evaluated on two different datasets, marked by circles (original) and triangles (new) on MMMU. The resulting change in task distribution highlights how Thinking Patterns dictate reasoning suitability across different tasks. 0 2 4 6 8 10 Comm. Arithmetic Statistical Algebraic GainCoT GainCoT* GainToken GainToken* [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: The effectiveness of a thinking pattern depends on its refinement and the exclusion of redundant or invalid reasoning. We compare the Gaintoken for both datasets on MathVista tasks. and variations. Scatter points of the same color tend to cluster in the same regions, representing the influence of task charac￾teristics on reasoning suitability. Triangular points appear more frequently in the CoT Advantage r… view at source ↗
Figure 5
Figure 5. Figure 5: We partition tasks into two halves using [PITH_FULL_IMAGE:figures/full_fig_p013_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: We partition tasks into two halves using [PITH_FULL_IMAGE:figures/full_fig_p013_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: We separately train models with data from the lower-left negative region and the remaining three positive regions. For [PITH_FULL_IMAGE:figures/full_fig_p014_7.png] view at source ↗
read the original abstract

Reasoning post-training improves Large Language Models (LLMs) on complex tasks such as mathematics and coding, but its benefits across diverse multimodal tasks remains uncertain. The trend of releasing parallel "Instruct" and "Thinking" models by leading teams is both resource-intensive and user-unfriendly. Prior work finds that the gains from reasoning training are influenced by multiple factors, such as base model capabilities, task characteristics, and Chain-of-Thought (CoT) data quality. However, principled criteria for determining when reasoning post-training is beneficial and which data should support it are still lacking. In this paper, we propose Dual Tuning, a reasoning efficacy-driven data curation framework for multimodal LLMs training. Given a target task and a base model, Dual Tuning jointly evaluates whether the training data is beneficial and whether reasoning training with current CoT content yields positive gains over non-reasoning alternatives. We apply Dual Tuning across spatial, mathematical, and multi-disciplinary tasks, and further analyze how reinforcement learning and thinking patterns affect reasoning efficacy. The Dual Tuning results guide data curation by identifying data that benefit reasoning training, data better suited to direct-answer training, and data that are detrimental under both training modes. Our work provides quantitative criteria for selecting appropriate training data and matching post-training strategies.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper proposes Dual Tuning, a reasoning efficacy-driven data curation framework for multimodal LLM training. Given a base model and target task, it jointly evaluates whether training data is beneficial and whether reasoning training using current CoT content produces positive gains over non-reasoning alternatives, then partitions the data into reasoning-beneficial, direct-answer suitable, or detrimental categories. The framework is applied to spatial, mathematical, and multi-disciplinary tasks, with additional analysis of reinforcement learning effects and thinking patterns to guide post-training strategy selection.

Significance. If empirically validated with clear quantitative support, Dual Tuning could supply practical, task-specific criteria for deciding when reasoning post-training is worthwhile in multimodal settings, potentially reducing the resource cost of indiscriminately training parallel Instruct and Thinking models.

major comments (2)
  1. Abstract: The description of Dual Tuning supplies the intended procedure and output categories but contains no quantitative results, error bars, or validation protocol, rendering it impossible to assess whether the joint evaluations actually support the claimed curation decisions.
  2. The weakest assumption section (implicit in the framework definition): The claim that joint evaluation of data benefit and reasoning gain can be performed reliably from the base model without introducing selection bias or requiring post-hoc adjustments is load-bearing for the partitioning results, yet no concrete measurement protocol, threshold derivation, or bias-mitigation steps are detailed.
minor comments (1)
  1. Clarify how 'thinking patterns' are operationalized and quantified in the RL analysis section, including any metrics or examples used to link them to efficacy.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address the major comments point by point below, providing clarifications on the framework and updating the presentation of results where appropriate.

read point-by-point responses
  1. Referee: Abstract: The description of Dual Tuning supplies the intended procedure and output categories but contains no quantitative results, error bars, or validation protocol, rendering it impossible to assess whether the joint evaluations actually support the claimed curation decisions.

    Authors: We agree that the abstract would be strengthened by including quantitative indicators. In the revised manuscript we have updated the abstract to report key empirical outcomes, including average accuracy gains on the three task categories, the fraction of data assigned to each partition, and a brief statement of the validation protocol based on held-out comparative evaluations. revision: yes

  2. Referee: The weakest assumption section (implicit in the framework definition): The claim that joint evaluation of data benefit and reasoning gain can be performed reliably from the base model without introducing selection bias or requiring post-hoc adjustments is load-bearing for the partitioning results, yet no concrete measurement protocol, threshold derivation, or bias-mitigation steps are detailed.

    Authors: The Dual Tuning procedure computes benefit and reasoning-gain scores by comparing the base model’s accuracy on a fixed validation split under direct-answer versus CoT-augmented inference. Thresholds are obtained from the empirical distribution of these scores via bootstrap resampling with a p < 0.05 significance cutoff. Bias mitigation is performed through stratified sampling of the validation set and repeated runs with different random seeds. We have added a new subsection in the Methods that spells out the exact scoring equations, the threshold derivation procedure, and the stratification steps. The framework deliberately avoids post-training adjustments so that the partitioning remains predictive from the base model alone; we have also inserted a short limitations paragraph acknowledging residual selection effects. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The Dual Tuning framework evaluates data benefit and reasoning gains directly from the base model on the target task, then partitions data into categories based on those evaluation outputs. The claimed quantitative criteria for curation are therefore the direct results of the described procedure rather than any fitted parameter, self-defined quantity, or self-citation chain. No equations, ansatzes, or load-bearing prior results from the same authors are invoked in the abstract or summary to force the outcome. The derivation remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available; no explicit free parameters, axioms, or invented entities are stated.

pith-pipeline@v0.9.0 · 5537 in / 1166 out tokens · 25742 ms · 2026-05-16T08:09:54.265144+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

31 extracted references · 31 canonical work pages · 16 internal anchors

  1. [1]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reason- ing capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

  2. [2]

    Internlm2 technical report, 2024

    Zheng Cai, Maosong Cao, Haojiong Chen, Kai Chen, Keyu Chen, Xin Chen, Xun Chen, Zehui Chen, Zhi Chen, Pei Chu, Xiaoyi Dong, Haodong Duan, Qi Fan, Zhaoye Fei, Yang Gao, Jiaye Ge, Chenya Gu, Yuzhe Gu, Tao Gui, Aijia Guo, Qipeng Guo, Conghui He, Yingfan Hu, Ting Huang, Tao Jiang, Penglong Jiao, Zhenjiang Jin, Zhikai Lei, Jiaxing Li, Jingwen Li, Linyang Li, S...

  3. [3]

    Qwen3 Technical Report

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

  4. [4]

    InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

    Weiyun Wang, Zhangwei Gao, Lixin Gu, Hengjun Pu, Long Cui, Xingguang Wei, Zhaoyang Liu, Linglin Jing, Shenglong Ye, Jie Shao, et al. Internvl3. 5: Advancing open-source multimodal models in versatility, reasoning, and efficiency.arXiv preprint arXiv:2508.18265, 2025

  5. [5]

    GLM-4.5V and GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning

    V Team, Wenyi Hong, Wenmeng Yu, Xiaotao Gu, Guo Wang, Guobing Gan, Haomiao Tang, Jiale Cheng, Ji Qi, Junhui Ji, Lihang Pan, Shuaiqi Duan, Weihan Wang, Yan Wang, Yean Cheng, Zehai He, Zhe Su, Zhen Yang, Ziyang Pan, Aohan Zeng, Baoxu Wang, Bin Chen, Boyan Shi, Changyu Pang, Chenhui Zhang, Da Yin, Fan Yang, Guoqing Chen, Jiazheng Xu, Jiale Zhu, Jiali Chen, J...

  6. [6]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junx- iao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

  7. [7]

    Qwen3-VL Technical Report

    Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, Wenbin Ge, Zhifang Guo, Qi- dong Huang, Jie Huang, Fei Huang, Binyuan Hui, Shu- tong Jiang, Zhaohai Li, Mingsheng Li, Mei Li, Kaixin Li, Zicheng Lin, Junyang Lin, Xuejing Liu, Jiawei Liu, Chenglong Liu, Yang Liu, Dayiheng Liu, Shix...

  8. [8]

    Qwen3-Omni Technical Report

    Jin Xu, Zhifang Guo, Hangrui Hu, Yunfei Chu, Xiong Wang, Jinzheng He, Yuxuan Wang, Xian Shi, Ting He, Xinfa Zhu, Yuanjun Lv, Yongqi Wang, Dake Guo, He Wang, Linhan Ma, Pei Zhang, Xinyu Zhang, Hongkun Hao, Zishan Guo, Baosong Yang, Bin Zhang, Ziyang Ma, Xipin Wei, Shuai Bai, Keqin Chen, Xuejing Liu, Peng Wang, Mingkun Yang, Dayiheng Liu, Xingzhang Ren, Bo ...

  9. [9]

    Cvbench: Evaluating cross-video synergies for complex multimodal understanding and reasoning.arXiv preprint arXiv:2508.19542,

    Nannan Zhu, Yonghao Dong, Teng Wang, Xueqian Li, Shengjun Deng, Yijia Wang, Zheng Hong, Tiantian Geng, Guo Niu, Hanyan Huang, et al. Cvbench: Benchmarking cross-video synergies for complex multimodal reasoning. arXiv preprint arXiv:2508.19542, 2025

  10. [10]

    Thinking in space: How mul- timodal large language models see, remember, and recall spaces

    Jihan Yang, Shusheng Yang, Anjali W Gupta, Rilyn Han, Li Fei-Fei, and Saining Xie. Thinking in space: How mul- timodal large language models see, remember, and recall spaces. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 10632–10643, 2025

  11. [11]

    Kat-v1: Kwai-autothink techni- cal report, 2025

    Zizheng Zhan, Ken Deng, Huaixi Tang, Wen Xiang, Kun Wu, Weihao Li, Wenqiang Zhu, Jingxuan Xu, Lecheng Huang, Zongxian Feng, Shaojie Wang, Shangpeng Yan, Xuxing Chen, Jiaheng Liu, Zhongyuan Peng, Zuchen Gao, Haoyang Huang, Xiaojiang Zhang, Jinghui Wang, Zheng Lin, Mengtong Li, Huiming Wang, Ziqi Zhan, Yanan Wu, Yuanxing Zhang, Jian Yang, Guang Chen, Haotia...

  12. [12]

    SpaceR: Reinforcing MLLMs in Video Spatial Reasoning

    Kun Ouyang, Yuanxin Liu, Haoning Wu, Yi Liu, Hao Zhou, Jie Zhou, Fandong Meng, and Xu Sun. Spacer: Re- Preprint. Under review. 10 inforcing mllms in video spatial reasoning.arXiv preprint arXiv:2504.01805, 2025

  13. [13]

    3dsrbench: A comprehensive 3d spatial reasoning benchmark

    Wufei Ma, Haoyu Chen, Guofeng Zhang, Yu-Cheng Chou, Jieneng Chen, Celso de Melo, and Alan Yuille. 3dsrbench: A comprehensive 3d spatial reasoning benchmark. InPro- ceedings of the IEEE/CVF International Conference on Computer Vision, pages 6924–6934, 2025

  14. [14]

    M2-reasoning: Empowering mllms with unified general and spatial reasoning.arXiv preprint arXiv:2507.08306, 2025

    Inclusion AI, Fudong Wang, Jiajia Liu, Jingdong Chen, Jun Zhou, Kaixiang Ji, Lixiang Ru, Qingpei Guo, Ruobing Zheng, Tianqi Li, et al. M2-reasoning: Empowering mllms with unified general and spatial reasoning.arXiv preprint arXiv:2507.08306, 2025

  15. [15]

    Omnispatial: Towards comprehensive spatial reasoning benchmark for vision language models.arXiv preprint arXiv:2506.03135, 2025

    Mengdi Jia, Zekun Qi, Shaochen Zhang, Wenyao Zhang, Xinqiang Yu, Jiawei He, He Wang, and Li Yi. Omnispatial: Towards comprehensive spatial reasoning benchmark for vision language models.arXiv preprint arXiv:2506.03135, 2025

  16. [16]

    Runqi Qiao, Qiuna Tan, Guanting Dong, MinhuiWu MinhuiWu, Chong Sun, Xiaoshuai Song, Jiapeng Wang, Zhuoma Gongque, Shanglin Lei, Yifan Zhang, et al. We- math: Does your large multimodal model achieve human- like mathematical reasoning? InProceedings of the 63rd Annual Meeting of the Association for Computational Lin- guistics (Volume 1: Long Papers), pages...

  17. [17]

    Yiqi Wang, Wentao Chen, Xiaotian Han, Xudong Lin, Haiteng Zhao, Yongfei Liu, Bohan Zhai, Jianbo Yuan, Quanzeng You, and Hongxia Yang. Exploring the reason- ing abilities of multimodal large language models (mllms): A comprehensive survey on emerging trends in multimodal reasoning.arXiv preprint arXiv:2401.06805, 2024

  18. [18]

    Thinking with Images for Multimodal Reasoning: Foundations, Methods, and Future Frontiers

    Zhaochen Su, Peng Xia, Hangyu Guo, Zhenhua Liu, Yan Ma, Xiaoye Qu, Jiaqi Liu, Yanshu Li, Kaide Zeng, Zhengyuan Yang, et al. Thinking with images for multi- modal reasoning: Foundations, methods, and future fron- tiers.arXiv preprint arXiv:2506.23918, 2025

  19. [19]

    MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts

    Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chun- yuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts.arXiv preprint arXiv:2310.02255, 2023

  20. [20]

    Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi

    Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, et al. Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9556–9567, 2024

  21. [21]

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923, 2025

  22. [22]

    Ming-omni: A unified multimodal model for perception and generation, 2025

    InclusionAI. Ming-omni: A unified multimodal model for perception and generation, 2025. URL https://arxiv. org/abs/2506.09344

  23. [23]

    Hypersim: A photorealistic synthetic dataset for holistic indoor scene understanding

    Mike Roberts, Jason Ramapuram, Anurag Ranjan, At- ulit Kumar, Miguel Angel Bautista, Nathan Paczan, Russ Webb, and Joshua M Susskind. Hypersim: A photorealistic synthetic dataset for holistic indoor scene understanding. InProceedings of the IEEE/CVF international conference on computer vision, pages 10912–10922, 2021

  24. [24]

    Chang, Manolis Savva, Maciej Hal- ber, Thomas Funkhouser, and Matthias Nießner

    Angela Dai, Angel X. Chang, Manolis Savva, Maciej Hal- ber, Thomas Funkhouser, and Matthias Nießner. Scannet: Richly-annotated 3d reconstructions of indoor scenes. In Proc. Computer Vision and Pattern Recognition (CVPR), IEEE, 2017

  25. [25]

    Scannet++: A high-fidelity dataset of 3d indoor scenes

    Chandan Yeshwanth, Yueh-Cheng Liu, Matthias Nießner, and Angela Dai. Scannet++: A high-fidelity dataset of 3d indoor scenes. InProceedings of the IEEE/CVF Inter- national Conference on Computer Vision, pages 12–22, 2023

  26. [26]

    ARKitScenes: A Diverse Real-World Dataset For 3D Indoor Scene Understanding Using Mobile RGB-D Data

    Gilad Baruch, Zhuoyuan Chen, Afshin Dehghan, Tal Dimry, Yuri Feigin, Peter Fu, Thomas Gebauer, Brandon Joffe, Daniel Kurz, Arik Schwartz, et al. Arkitscenes: A di- verse real-world dataset for 3d indoor scene understanding using mobile rgb-d data.arXiv preprint arXiv:2111.08897, 2021

  27. [27]

    Sat: Dy- namic spatial aptitude training for multimodal language models.arXiv preprint arXiv:2412.07755, 2024

    Arijit Ray, Jiafei Duan, Ellis Brown, Reuben Tan, Dina Bashkirova, Rose Hendrix, Kiana Ehsani, Aniruddha Kem- bhavi, Bryan A Plummer, Ranjay Krishna, et al. Sat: Dy- namic spatial aptitude training for multimodal language models.arXiv preprint arXiv:2412.07755, 2024

  28. [28]

    Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

    Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blis- tein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodal- ity, long context, and next generation agentic capabilities. arXiv preprint arXiv:2507.06261, 2025

  29. [29]

    OneThinker: All-in-one Reasoning Model for Image and Video

    Kaituo Feng, Manyuan Zhang, Hongyu Li, Kaixuan Fan, Shuang Chen, Yilei Jiang, Dian Zheng, Peiwen Sun, Yiyuan Zhang, Haoze Sun, et al. Onethinker: All-in-one reasoning model for image and video.arXiv preprint arXiv:2512.03043, 2025

  30. [30]

    SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training

    Tianzhe Chu, Yuexiang Zhai, Jihan Yang, Shengbang Tong, Saining Xie, Dale Schuurmans, Quoc V Le, Sergey Levine, and Yi Ma. Sft memorizes, rl generalizes: A comparative study of foundation model post-training.arXiv preprint arXiv:2501.17161, 2025

  31. [31]

    Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?

    Yang Yue, Zhiqi Chen, Rui Lu, Andrew Zhao, Zhaokai Wang, Shiji Song, and Gao Huang. Does reinforcement learning really incentivize reasoning capacity in llms be- yond the base model?arXiv preprint arXiv:2504.13837, 2025. Preprint. Under review. 11 A Appendix Table 8:Ming-lite-omniexperimental results on spatial tasks. Values in red and green denote negati...