arxiv: 2605.01520 · v1 · submitted 2026-05-02 · 💻 cs.CV · cs.CL

Recognition: unknown

MIRL: Mutual Information-Guided Reinforcement Learning for Vision-Language Models

Jiaxuan Zhao, Junfeng Fang, Kun Wang, Qingsong Wen, Yilei Shao, Yin Zhang, Zengxiang Li, Zonghan Wu

Authors on Pith no claims yet

Pith reviewed 2026-05-09 14:24 UTC · model grok-4.3

classification 💻 cs.CV cs.CL

keywords vision-language modelsreinforcement learningmutual informationtrajectory samplingvisual perceptionreasoninghallucinationsdecoupled training

0 comments

The pith

MIRL uses mutual information between visual descriptions and images to pre-filter promising reasoning trajectories in vision-language models, raising accuracy while cutting the number of full samples needed.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Vision-language models often produce early visual description errors that make entire reasoning paths fail, wasting reinforcement learning samples and leaving sparse rewards unable to tell whether perception or reasoning caused the mistake. MIRL treats mutual information between a generated description and the input image as a low-cost pre-screening signal that identifies high-potential trajectories early. It then forks computation toward those trajectories and supplies separate MI-based rewards so that visual perception and reasoning can be optimized independently. On six benchmarks the method reaches 70.22 percent average accuracy while matching the performance of sampling sixteen full trajectories with only ten pre-samples and top-six selection, a 25 percent reduction in complete trajectories.

Core claim

MIRL is a decoupled reinforcement-learning framework that uses mutual information between generated visual descriptions and input images both as an early filter to allocate sampling budget via trajectory forking and as an independent reward signal that allows separate optimization of the visual-perception and reasoning stages.

What carries the argument

Mutual information between generated descriptions and visual inputs, used both for early trajectory forking and for decoupled perception/reasoning rewards.

If this is right

Sampling budget is redirected away from trajectories doomed by early visual errors.
Visual perception can be trained with its own dense signal instead of waiting for end-of-trajectory answer correctness.
Reasoning optimization is no longer blinded by perception failures.
The same pre-sample and top-k selection procedure achieves the accuracy previously obtained only with 25 percent more complete trajectories.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The approach could be tested on tasks where perception errors are rarer but reasoning depth is greater, to see whether the MI signal remains useful.
If the MI pre-screen correlates with human judgments of description quality, it might serve as a lightweight proxy for human preference data in multimodal RL.
Decoupling the two stages suggests that perception modules could be swapped or fine-tuned without retraining the entire reasoning policy.

Load-bearing premise

Mutual information between a generated description and the visual input reliably indicates which trajectories will produce correct final answers after full reasoning.

What would settle it

An experiment on one of the six benchmarks in which trajectories ranked high by mutual information frequently produce wrong final answers while low-MI trajectories succeed.

Figures

Figures reproduced from arXiv: 2605.01520 by Jiaxuan Zhao, Junfeng Fang, Kun Wang, Qingsong Wen, Yilei Shao, Yin Zhang, Zengxiang Li, Zonghan Wu.

**Figure 1.** Figure 1: Top: Existing methods allocate uniform sampling budget across all trajectories regardless of description quality, view at source ↗

**Figure 2.** Figure 2: Vision success rate vs. MI score across six benchmarks. Higher MI correlates with higher task accuracy (Pearson view at source ↗

**Figure 3.** Figure 3: Training dynamics of the MI Score in the ablation view at source ↗

read the original abstract

Vision-Language Models (VLMs) frequently suffer from visual perception errors and hallucinations that compromise answer accuracy in complex reasoning tasks. Reinforcement Learning with Verifiable Rewards (RLVR) offers a promising solution by optimizing policies using answer correctness signals. Despite their effectiveness, prevailing RLVR methods face two critical limitations. First, much of the sampling budget is wasted on trajectories doomed to fail due to early visual description errors. Second, sparse rewards cannot distinguish whether failures stem from visual perception or reasoning stages. We introduce MIRL, a decoupled framework that addresses both limitations by leveraging mutual information (MI) between generated descriptions and visual inputs as a cheap pre-screening signal. This enables intelligent budget allocation toward high-potential trajectories via forking, while decoupled training provides independent MI-based rewards for visual perception optimization, resolving reward blindness. Experiments on six vision-language reasoning benchmarks demonstrate that MIRL achieves 70.22% average accuracy and successfully surpasses the performance of sampling 16 complete trajectories using only 10 pre-samples with top-6 selection (25% fewer complete trajectories). Our code is available at: https://anonymous.4open.science/r/mirl-main/.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MIRL's MI pre-screening and forking can reduce trajectory sampling in VLM RL, but the paper leaves the key correlation between MI and success unexamined.

read the letter

MIRL applies mutual information between early generated descriptions and the visual input to pre-screen trajectories in reinforcement learning for vision-language models. It forks the high-MI ones for full processing and adds a separate MI-based reward to train the perception component independently. This decoupled approach with forking is the main novelty here. It tackles the waste from sampling doomed trajectories and the problem of not knowing whether a failure came from bad perception or bad reasoning. The experiments show an average accuracy of 70.22 percent across six benchmarks, and they say it outperforms sampling sixteen full trajectories by using just ten pre-samples and selecting the top six by MI score, which cuts the number of complete trajectories by 25 percent. The efficiency result is the part that could matter for people scaling up RL on these models. Having the code available makes it easier to check. The weak point is the lack of evidence that MI actually correlates with final answer correctness. The method assumes high-MI descriptions are more likely to lead to right answers after reasoning, but without reported checks on that link, the savings could be specific to their tasks or setup rather than a general win. If MI picks up surface matches but misses subtle issues that cause later failures, the top-6 selection might not deliver the claimed advantage over simpler baselines. The results are presented without much detail on run-to-run variation or exact MI computation, so reproducibility of the exact numbers would need the full methods section. This is worth a look for anyone doing RL fine-tuning on VLMs and looking for ways to spend less on bad samples. A reader focused on practical training tricks would get value from the forking idea and the decoupled rewards. I would send it to peer review. The idea is solid enough to test further, even if the current evidence for the MI proxy needs strengthening.

Referee Report

2 major / 2 minor

Summary. The paper introduces MIRL, a decoupled reinforcement learning framework for vision-language models that uses mutual information (MI) between generated visual descriptions and inputs as a pre-screening signal. This enables forking of high-potential trajectories to allocate sampling budget efficiently and provides independent MI-based rewards to optimize visual perception separately from reasoning, addressing wasted computation on failing trajectories and sparse rewards in standard RLVR. Experiments across six vision-language reasoning benchmarks report an average accuracy of 70.22%, with the method outperforming the performance of sampling 16 complete trajectories by using only 10 pre-samples followed by top-6 MI selection (a 25% reduction in complete trajectories). Code is provided at an anonymous repository.

Significance. If the MI pre-screening reliably identifies viable trajectories, MIRL could meaningfully advance sample-efficient RLVR for VLMs by mitigating early visual perception failures and enabling stage-specific optimization. The availability of code supports reproducibility and allows independent verification of the efficiency claims.

major comments (2)

[Experiments] The central efficiency claim (surpassing 16 full trajectories via 10 pre-samples + top-6 MI selection) is load-bearing for the paper's contribution, yet the manuscript provides no reported correlation, ablation, or statistical test between MI scores of partial trajectories and their eventual answer correctness (e.g., in the Experiments section or associated tables). Without this, it remains possible that MI primarily reflects surface visual alignment rather than downstream reasoning viability, rendering the 25% reduction an artifact rather than a general property.
[Method] The decoupling into visual perception and reasoning stages (with independent MI rewards for the former) assumes failures can be cleanly attributed and optimized separately; however, the paper does not quantify how often high-MI descriptions still lead to incorrect final answers due to subtle hallucinations or reasoning errors, which would undermine the reward signal's reliability.

minor comments (2)

[Abstract] The abstract and introduction would benefit from a brief explicit statement of how MI is computed (e.g., which estimator or approximation is used) to aid readers unfamiliar with the signal.
[Experiments] Tables reporting per-benchmark accuracies should include standard deviations or confidence intervals across runs to allow assessment of the stability of the 70.22% average.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below and describe the revisions we will incorporate to strengthen the empirical grounding of the MI-based pre-screening and decoupled rewards.

read point-by-point responses

Referee: [Experiments] The central efficiency claim (surpassing 16 full trajectories via 10 pre-samples + top-6 MI selection) is load-bearing for the paper's contribution, yet the manuscript provides no reported correlation, ablation, or statistical test between MI scores of partial trajectories and their eventual answer correctness (e.g., in the Experiments section or associated tables). Without this, it remains possible that MI primarily reflects surface visual alignment rather than downstream reasoning viability, rendering the 25% reduction an artifact rather than a general property.

Authors: We agree that a direct analysis of the relationship between partial-trajectory MI scores and final answer correctness is necessary to substantiate the efficiency claim. In the revised manuscript we will add a dedicated ablation subsection (and associated table) that reports (i) the Pearson correlation between MI of the 10 pre-samples and eventual correctness on the full trajectory, (ii) success-rate curves when trajectories are binned by MI quartile, and (iii) statistical significance tests (paired t-tests with p-values) comparing final accuracy for high-MI versus low-MI selections. This analysis will be performed on the same six benchmarks and will explicitly test whether MI predicts downstream reasoning viability beyond surface-level visual alignment. revision: yes
Referee: [Method] The decoupling into visual perception and reasoning stages (with independent MI rewards for the former) assumes failures can be cleanly attributed and optimized separately; however, the paper does not quantify how often high-MI descriptions still lead to incorrect final answers due to subtle hallucinations or reasoning errors, which would undermine the reward signal's reliability.

Authors: We acknowledge that quantifying the residual failure modes of high-MI trajectories is important for validating the decoupled reward design. In the revision we will add an error-breakdown analysis that measures, for trajectories with above-median MI, the percentage of final-answer errors attributable to (a) visual-perception hallucinations versus (b) reasoning or subtle hallucination errors downstream of the description. This will be reported both as aggregate percentages across benchmarks and via representative qualitative examples, thereby providing concrete evidence on the reliability of the MI-based visual reward. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation relies on external MI computation and benchmark experiments

full rationale

The paper defines MIRL using mutual information computed directly between generated descriptions and visual inputs as an independent pre-screening signal, with forking and decoupled rewards derived from this calculation rather than from the final accuracy metric. Efficiency claims are supported by explicit experimental comparisons (10 pre-samples + top-6 vs. 16 full trajectories) on six external benchmarks, not by construction or parameter fitting that renames inputs as outputs. No self-definitional equations, fitted-input predictions, load-bearing self-citations, or ansatz smuggling appear in the derivation chain. The method remains self-contained against verifiable rewards and benchmark results.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Based on abstract only, the method relies on the domain assumption that MI is a good signal for visual perception quality. No free parameters or new entities explicitly mentioned.

axioms (1)

domain assumption Mutual information can be computed efficiently between text descriptions and visual features as a proxy for trajectory quality.
Assumed in the pre-screening mechanism described in the abstract.

pith-pipeline@v0.9.0 · 5522 in / 1136 out tokens · 27893 ms · 2026-05-09T14:24:36.517036+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

45 extracted references · 17 canonical work pages · 5 internal anchors

[1]

Wenliang Dai and Junnan Li and Dongxu Li and Anthony Meng Huat Tiong and Junqi Zhao and Weisheng Wang and Boyang Li and Pascale Fung and Steven C. H. Hoi , editor =. InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning , booktitle =. 2023 , url =

2023
[2]

Visual Instruction Tuning , booktitle =

Haotian Liu and Chunyuan Li and Qingyang Wu and Yong Jae Lee , editor =. Visual Instruction Tuning , booktitle =. 2023 , url =

2023
[3]

CoRR , volume =

Jinze Bai and Shuai Bai and Shusheng Yang and Shijie Wang and Sinan Tan and Peng Wang and Junyang Lin and Chang Zhou and Jingren Zhou , title =. CoRR , volume =. 2023 , url =

2023
[4]

CoRR , volume =

Peng Wang and Shuai Bai and Sinan Tan and Shijie Wang and Zhihao Fan and Jinze Bai and Keqin Chen and Xuejing Liu and Jialin Wang and Wenbin Ge and Yang Fan and Kai Dang and Mengfei Du and Xuancheng Ren and Rui Men and Dayiheng Liu and Chang Zhou and Jingren Zhou and Junyang Lin , title =. CoRR , volume =. 2024 , url =

2024
[5]

Zhihong Shao and Peiyi Wang and Qihao Zhu and Runxin Xu and Junxiao Song and Mingchuan Zhang and Y. K. Li and Y. Wu and Daya Guo , title =. CoRR , volume =. 2024 , url =. doi:10.48550/ARXIV.2402.03300 , eprinttype =. 2402.03300 , timestamp =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2402.03300 2024
[6]

DAPO: An Open-Source LLM Reinforcement Learning System at Scale

Qiying Yu and Zheng Zhang and Ruofei Zhu and Yufeng Yuan and Xiaochen Zuo and Yu Yue and Tiantian Fan and Gaohong Liu and Lingjun Liu and Xin Liu and Haibin Lin and Zhiqi Lin and Bole Ma and Guangming Sheng and Yuxuan Tong and Chi Zhang and Mofan Zhang and Wang Zhang and Hang Zhu and Jinhua Zhu and Jiaze Chen and Jiangjie Chen and Chengyi Wang and Hongli ...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2503.14476 2025
[7]

Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models

Wenxuan Huang and Bohan Jia and Zijie Zhai and Shaosheng Cao and Zheyu Ye and Fei Zhao and Zhe Xu and Yao Hu and Shaohui Lin , title =. CoRR , volume =. 2025 , url =. doi:10.48550/ARXIV.2503.06749 , eprinttype =. 2503.06749 , timestamp =

work page internal anchor Pith review doi:10.48550/arxiv.2503.06749 2025
[8]

The Twelfth International Conference on Learning Representations,

Hunter Lightman and Vineet Kosaraju and Yuri Burda and Harrison Edwards and Bowen Baker and Teddy Lee and Jan Leike and John Schulman and Ilya Sutskever and Karl Cobbe , title =. The Twelfth International Conference on Learning Representations,. 2024 , url =

2024
[9]

Math-shepherd: Verify and reinforce llms step-by-step without human annotations

Peiyi Wang and Lei Li and Zhihong Shao and Runxin Xu and Damai Dai and Yifei Li and Deli Chen and Yu Wu and Zhifang Sui , editor =. Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations , booktitle =. 2024 , url =. doi:10.18653/V1/2024.ACL-LONG.510 , timestamp =

work page doi:10.18653/v1/2024.acl-long.510 2024
[10]

T-VSL: text-guided visual sound source localization in mixtures

Alessandro Favero and Luca Zancato and Matthew Trager and Siddharth Choudhary and Pramuditha Perera and Alessandro Achille and Ashwin Swaminathan and Stefano Soatto , title =. 2024 , url =. doi:10.1109/CVPR52733.2024.01356 , timestamp =

work page doi:10.1109/cvpr52733.2024.01356 2024
[11]

arXiv preprint arXiv:2505.19678 (2025)

Hao Fang and Changle Zhou and Jiawei Kong and Kuofeng Gao and Bin Chen and Tao Liang and Guojun Ma and Shu. Grounding Language with Vision:. CoRR , volume =. 2025 , url =. doi:10.48550/ARXIV.2505.19678 , eprinttype =. 2505.19678 , timestamp =

work page doi:10.48550/arxiv.2505.19678 2025
[12]

T-VSL: text-guided visual sound source localization in mixtures

Sicong Leng and Hang Zhang and Guanzheng Chen and Xin Li and Shijian Lu and Chunyan Miao and Lidong Bing , title =. 2024 , url =. doi:10.1109/CVPR52733.2024.01316 , timestamp =

work page doi:10.1109/cvpr52733.2024.01316 2024
[13]

MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts , booktitle =

Pan Lu and Hritik Bansal and Tony Xia and Jiacheng Liu and Chunyuan Li and Hannaneh Hajishirzi and Hao Cheng and Kai. MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts , booktitle =. 2024 , url =

2024
[14]

URLhttps://doi.org/10.1007/978-3-031-73242-3 10

Renrui Zhang and Dongzhi Jiang and Yichi Zhang and Haokun Lin and Ziyu Guo and Pengshuo Qiu and Aojun Zhou and Pan Lu and Kai. Computer Vision -. 2024 , url =. doi:10.1007/978-3-031-73242-3\_10 , timestamp =

work page doi:10.1007/978-3-031-73242-3 2024
[15]

Are We on the Right Way for Evaluating Large Vision-Language Models? , booktitle =

Lin Chen and Jinsong Li and Xiaoyi Dong and Pan Zhang and Yuhang Zang and Zehui Chen and Haodong Duan and Jiaqi Wang and Yu Qiao and Dahua Lin and Feng Zhao , editor =. Are We on the Right Way for Evaluating Large Vision-Language Models? , booktitle =. 2024 , url =

2024
[16]

T-VSL: text-guided visual sound source localization in mixtures

Tianrui Guan and Fuxiao Liu and Xiyang Wu and Ruiqi Xian and Zongxia Li and Xiaoyu Liu and Xijun Wang and Lichang Chen and Furong Huang and Yaser Yacoob and Dinesh Manocha and Tianyi Zhou , title =. 2024 , url =. doi:10.1109/CVPR52733.2024.01363 , timestamp =

work page doi:10.1109/cvpr52733.2024.01363 2024
[17]

We-Math: Does Your Large Multimodal Model Achieve Human-like Mathematical Reasoning? , booktitle =

Runqi Qiao and Qiuna Tan and Guanting Dong and Minhui Wu and Chong Sun and Xiaoshuai Song and Jiapeng Wang and Zhuoma Gongque and Shanglin Lei and Yifan Zhang and Zhe Wei and Miaoxuan Zhang and Runfeng Qiao and Xiao Zong and Yida Xu and Peiqing Yang and Zhimin Bao and Muxi Diao and Chen Li and Honggang Zhang , editor =. We-Math: Does Your Large Multimodal...

2025
[18]

Qiguang Chen and Libo Qin and Jin Zhang and Zhi Chen and Xiao Xu and Wanxiang Che , editor =. M. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers),. 2024 , url =. doi:10.18653/V1/2024.ACL-LONG.446 , timestamp =

work page doi:10.18653/v1/2024.acl-long.446 2024
[19]

2024 , howpublished =

xAI , title =. 2024 , howpublished =

2024
[20]

XNOR-Net: ImageNet Classification Using Binary Convolutional Neural Networks , booktitle =

Aniruddha Kembhavi and Mike Salvato and Eric Kolve and Min Joon Seo and Hannaneh Hajishirzi and Ali Farhadi , editor =. A Diagram is Worth a Dozen Images , booktitle =. 2016 , url =. doi:10.1007/978-3-319-46493-0\_15 , timestamp =

work page doi:10.1007/978-3-319-46493-0 2016
[21]

doi: 10.18653/v1/2022.findings-acl.177

Ahmed Masry and Do Xuan Long and Jia Qing Tan and Shafiq R. Joty and Enamul Hoque , editor =. ChartQA:. Findings of the Association for Computational Linguistics:. 2022 , url =. doi:10.18653/V1/2022.FINDINGS-ACL.177 , timestamp =

work page doi:10.18653/v1/2022.findings-acl.177 2022
[22]

Evaluating Object Hallucination in Large Vision-Language Models

Yifan Li and Yifan Du and Kun Zhou and Jinpeng Wang and Wayne Xin Zhao and Ji. Evaluating Object Hallucination in Large Vision-Language Models , booktitle =. 2023 , url =. doi:10.18653/V1/2023.EMNLP-MAIN.20 , timestamp =

work page doi:10.18653/v1/2023.emnlp-main.20 2023
[23]

T-VSL: text-guided visual sound source localization in mixtures

Tianyu Yu and Yuan Yao and Haoye Zhang and Taiwen He and Yifeng Han and Ganqu Cui and Jinyi Hu and Zhiyuan Liu and Hai. 2024 , url =. doi:10.1109/CVPR52733.2024.01310 , timestamp =

work page doi:10.1109/cvpr52733.2024.01310 2024
[24]

Proximal Policy Optimization Algorithms

John Schulman and Filip Wolski and Prafulla Dhariwal and Alec Radford and Oleg Klimov , title =. CoRR , volume =. 2017 , url =. 1707.06347 , timestamp =

work page internal anchor Pith review Pith/arXiv arXiv 2017
[25]

Long Ouyang and Jeffrey Wu and Xu Jiang and Diogo Almeida and Carroll L. Wainwright and Pamela Mishkin and Chong Zhang and Sandhini Agarwal and Katarina Slama and Alex Ray and John Schulman and Jacob Hilton and Fraser Kelton and Luke Miller and Maddie Simens and Amanda Askell and Peter Welinder and Paul F. Christiano and Jan Leike and Ryan Lowe , editor =...

2022
[26]

Manning and Stefano Ermon and Chelsea Finn , editor =

Rafael Rafailov and Archit Sharma and Eric Mitchell and Christopher D. Manning and Stefano Ermon and Chelsea Finn , editor =. Direct Preference Optimization: Your Language Model is Secretly a Reward Model , booktitle =. 2023 , url =

2023
[27]

Representation Learning with Contrastive Predictive Coding

A. Representation Learning with Contrastive Predictive Coding , journal =. 2018 , url =. 1807.03748 , timestamp =

work page internal anchor Pith review Pith/arXiv arXiv 2018
[28]

Learning Transferable Visual Models From Natural Language Supervision , booktitle =

Alec Radford and Jong Wook Kim and Chris Hallacy and Aditya Ramesh and Gabriel Goh and Sandhini Agarwal and Girish Sastry and Amanda Askell and Pamela Mishkin and Jack Clark and Gretchen Krueger and Ilya Sutskever , editor =. Learning Transferable Visual Models From Natural Language Supervision , booktitle =. 2021 , url =

2021
[29]

CoRR , volume =

Yuzheng Cai and Siqi Cai and Yuchen Shi and Zihan Xu and Lichao Chen and Yulei Qin and Xiaoyu Tan and Gang Li and Zongyi Li and Haojia Lin and Yong Mao and Ke Li and Xing Sun , title =. CoRR , volume =
[30]

Courville and Alessandro Sordoni and Rishabh Agarwal , title =

Arian Hosseini and Xingdi Yuan and Nikolay Malkin and Aaron C. Courville and Alessandro Sordoni and Rishabh Agarwal , title =. CoRR , volume =
[31]

CoRR , volume =

Liangchen Luo and Yinxiao Liu and Rosanne Liu and Samrat Phatale and Harsh Lara and Yunxuan Li and Lei Shu and Yun Zhu and Lei Meng and Jiao Sun and Abhinav Rastogi , title =. CoRR , volume =
[32]

Hinton , title =

Ting Chen and Simon Kornblith and Mohammad Norouzi and Geoffrey E. Hinton , title =
[33]

Girshick , title =

Kaiming He and Haoqi Fan and Yuxin Wu and Saining Xie and Ross B. Girshick , title =
[34]

Tom Schaul and John Quan and Ioannis Antonoglou and David Silver , title =
[35]

CoRR , volume =

Karl Cobbe and Vineet Kosaraju and Mohammad Bavarian and Mark Chen and Heewoo Jun and Lukasz Kaiser and Matthias Plappert and Jerry Tworek and Jacob Hilton and Reiichiro Nakano and Christopher Hesse and John Schulman , title =. CoRR , volume =
[36]

Le and Ed H

Xuezhi Wang and Jason Wei and Dale Schuurmans and Quoc V. Le and Ed H. Chi and Sharan Narang and Aakanksha Chowdhery and Denny Zhou , title =
[37]

Goodman , title =

Eric Zelikman and Yuhuai Wu and Jesse Mu and Noah D. Goodman , title =. NeurIPS , year =
[38]

Reinforced Self-Training (ReST) for Language Modeling , journal =
[39]

Yaniv Leviathan and Matan Kalman and Yossi Matias , title =
[40]

Curriculum learning , booktitle =

Yoshua Bengio and J. Curriculum learning , booktitle =
[41]

CoRR , volume =

Haozhan Shen and Peng Liu and Jingcheng Li and Chunxin Fang and Yibo Ma and Jiajia Liao and Qiaoli Shen and Zilun Zhang and Kangjia Zhao and Qianqian Zhang and Ruochen Xu and Tiancheng Zhao , title =. CoRR , volume =
[42]

Qidong Huang and Xiaoyi Dong and Pan Zhang and Bin Wang and Conghui He and Jiaqi Wang and Dahua Lin and Weiming Zhang and Nenghai Yu , title =
[43]

CoRR , volume =

Yijia Xiao and Edward Sun and Tianyu Liu and Wei Wang , title =. CoRR , volume =
[44]

Hybridflow: A flexible and efficient rlhf framework

Sheng, Guangming and Zhang, Chi and Ye, Zilingfeng and Wu, Xibin and Zhang, Wang and Zhang, Ru and Peng, Yanghua and Lin, Haibin and Wu, Chuan , year=. HybridFlow: A Flexible and Efficient RLHF Framework , url=. doi:10.1145/3689031.3696075 , booktitle=

work page doi:10.1145/3689031.3696075
[45]

CoRR , volume =

Yufei Zhan and Ziheng Wu and Yousong Zhu and Rongkun Xue and Ruipu Luo and Zhenghao Chen and Can Zhang and Yifan Li and Zhentao He and Zheming Yang and Ming Tang and Minghui Qiu and Jinqiao Wang , title =. CoRR , volume =